HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Personalize Job
Recommendation System
DUC-THAI DO
Thai.DD211260M@sis.hust.edu.vn
Major: Data Science and Artificial Intelligence
Thesis advisor : Dr. Tran Viet Trung
Department : Computer Science
Institute : School of Information and Communication Technology
Hanoi, 04-2023
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Personalize Job
Recommendation System
DUC-THAI DO
Thai.DD211260M@sis.hust.edu.vn
Major: Data Science and Artificial Intelligence
Thesis advisor : Dr. Tran Viet Trung
Signature of advisor
Department : Computer Science
Institute : School of Information and Communication Technology
Hanoi, 04-2023
6Ĉ+47%0 %DQKjQKOҫQ1 ngày 11/11/2014
&Ӝ1*+Ñ$;+Ӝ,&+Ӫ1*+Ƭ$9,ӊ71$0
ĈӝFOұS 7ӵGR +ҥQKSK~F
%Ҧ1;È&1+Ұ1&+ӌ1+6Ӱ$ /8Ұ19Ă17+Ҥ&6Ƭ
+ӑYjWrQWiFJLҧOXұQYăQĈӛĈӭF7KiL
ĈӅWjLOXұQYăQ&iQKkQKRiKӋWKӕQJJӧLêYLӋFOjP
Chuyên ngành: .KRDKӑFGӳOLӋXvà TUtWXӋQKkQWҥR (Elitech)
0mVӕ69: 20211260M
7iFJLҧ1JѭӡLKѭӟQJGүQNKRDKӑFYj+ӝLÿӗQJFKҩPOXұQYăQ[iFQKұQWiFJLҧÿm
VӱDFKӳDEәVXQJOXұQYăQWKHRELrQEҧQKӑS+ӝLÿӗQJQJj\ 22/04/2023 YӟLFiFQӝL
dung sau:
STT 1ӝLGXQJFKӍQKVӱD Trang
1 %ӓÿiQKVӕFKѭѫQJWURQJFKѭѫQJPӣÿҫXYjNӃWOXұQ 9, 69
2 %әVXQJOjPU}PөFWLrXFӫDOXұQYăQWURQJSKҫQ0ӣÿҫXYjOjPU}
KjPêFiQKkQKRi³SHUVRQDOL]LQJ´KӋWKӕQJWѭYҩQ
9
3 7URQJFKѭѫQJ EDQÿҫXEәVXQJOjPU}KѫQNKiLQLӋPFiFNӻWKXұW
“item popularity”, “user-item matching”
20-21
4 ĈѭDQӝLGXQJFKѭѫQJ EDQÿҫX WUuQKEj\GDWDVHWVYjFiFKRҥWÿӝQJ
FөWKӇFKXҭQKRiKDLGDWDVHWVQKѭPөFFRQFӫDFKѭѫQJ EDQÿҫXYj
FKӍQKWrQFKѭѫQJnày ÿӇOjPU}KѫQQӝLGXQJWUuQKEj\
27-39
5Trình bày mô KuQKWәQJWKӇFӫDEjLWRiQWѭYҩQYLӋFOjPYjOXұQJLҧL
FiFNƭWKXұWÿmOӵDFKӑQ/jPU}KѫQSLSHOLQHWKӇKLӋQSKѭѫQJSKiS
ÿmWLӃQKjQKWKӵFKLӋQ
27-28
6 0{WҧNƭKѫQFiFKWLӃSFұQVӱGөQJSKѭѫQJSKiSNӃWKӧSK\EULG 50
7 0{WҧNƭKѫQYLӋFÿiQKJLiKӋWѭYҩQYӟLWұS5HF6\V 61
8 6ӱGөQJWKrPÿӝÿR0$3# 60-67
9 ĈiQKJLiVRViQKNӃWTXҧWKӵFQJKLӋPYӟLF{QJEӕNKiFVӱGөQJFQJ
datasets
68
10 3KkQWtFKNƭKѫQNӃWTXҧFӫDP{KuQKGӵDWUrQEҧQFKҩWGӳOLӋX 61-68
11 %әVXQJFiFWUtFKGүQFzQWKLӃX 21-22
12 %әVXQJWLrXÿӅYjWrQFiFWUөFFӫDFiFELӇXÿӗ 32-65
13 5jVRiWKLӋXFKӍQKFiFOӛLVRҥQWKҧR
1Jj\WKiQJQăP 2023
*LiRYLrQKѭӟQJGүQ 7iFJLҧOXұQYăQ
&+Ӫ7ӎ&++Ӝ,ĈӖ1*
Graduation Thesis Assignment
Name: Duc-Thai Do
Phone: +84 902210496
Email: Thai.DD211260M@sis.hust.edu.vn; thai.dec1mo@gmail.com
Class: 21A-IT-KHDL-E/CH2021A
Affiliation: Hanoi University of Science and Technology
I - Duc-Thai Do - hereby warrants that the work and presentation in this thesis were
performed by myself under the supervision of Dr. Tran Viet Trung. All the results
presented in this thesis are truthful and are not copied from any other works. All
references in this thesis including images, tables, figures, and quotes are clearly and
fully documented in the bibliography. I will take full responsibility for even one copy
that violates school regulations.
Hanoi, April 2023
Author
Duc-Thai Do
3
ACKNOWLEDGMENTS
Before presenting the main content of the thesis, I would like to dedicate these lines
to send my most sincere thanks to the people who have helped and shaped the person
Iamtoday.
I thank my parents and grandparents for raising, being with, and supporting me
unconditionally. They have given me a family that could not be more wonderful, a
place that always gives me motivation whenever I feel tired or stumble on the road of
life.
In order to complete this graduation project, I would like to express my sincerest
thanks to Mr. Tran Viet Trung, who not only suggested and gave new ideas, but also
closely guided and encouraged me to overcome this difficult period. Working with you
is one of the luckiest things I have had. Thanks to your encouragement and enthusiastic
guidance, I was able to pass and complete this graduation thesis. At the same time, I
would also like to thank all the teachers of Hanoi University of Science and Technology
for always doing their best for us. The teachers have brought us valuable knowledge
and experience that we can’t get anywhere else. I hope the teachers are always healthy
to continue educating the next generation of Bach Khoa students.
I would like to thank my loved one for always being there, cheering, and helping me
during my study and graduation thesis.
4
ABSTRACT
Nowadays, online recruitment websites have become one of the main channels for
people to search for jobs. These platforms have saved a lot of time and money for both
the candidate as well as the recruiting organization. These platforms have saved a lot
of time and money for both job seekers as well as recruiting organizations. However,
traditional information retrieval techniques such as searching for the desired job through
a keyword using a search engine are not suitable. The reason is that because the number
of results returned to job seekers can be very large, they need to spend considerable
time reading and reviewing their options, resulting in a very boring and difficult job
search experience.
For that reason, this thesis aims at building an effective job recommendation system,
increasing the personalization and relevance of job search results, and helping user
experience and job searching journey become more easy and more exciting.
To achieve this, the thesis has analyzed data on job listings and job seeker character-
istics and behavior from two labor market datasets: RecSys2016 and CareerBuilder2012
researched and utilized a combination of different techniques, tools, and models in the
fields of natural language processing, machine learning, and recommendation systems to
implement, and experiment various job recommendation algorithms: item popularity,
user-item matching, content-based, user-based collaborative filtering, and graph neural
network on the two labor dataset.
The effectiveness of the system has been evaluated based on two metrics: Map@K
and RSScore, showing the practical and positive results in recommendations in the
field of the labor market.
5
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 1. Theoretical basis ........................................... 16
1.1.Foundation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.1. TF-IDF.................................................................. 16
1.1.2. Cosine Similarity......................................................... 17
1.1.3. K-nearest neighbors...................................................... 17
1.1.4. Overview of neural network............................................... 18
1.1.5. Overview of graph neural network........................................ 19
1.2.Overview of recommendation system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.1. Item popularity recommendation......................................... 21
1.2.2. User-item matching recommendation..................................... 22
1.2.3. Content-based recommendation system................................... 22
1.2.4. Collaborative ltering recommendation system............................ 23
1.2.5. Hybrid recommendation system .......................................... 25
1.3.Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 2. Proposed approaches for job recommendation . . . . . . . . . . . . . . . 28
2.1.Overall model construction of the job recommendation system . . 28
2.2.Dataset description and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1. RecSys2016 dataset. . ..................................................... 29
2.2.2. CareerBuilder2012 dataset................................................ 40
2.3.Data labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.Recommendation model implementation . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.1. Item popularity approach................................................. 49
2.5.2. User-item matching approach............................................. 50
2.5.3. Content-based with item popularity approach............................. 51
2.5.4. Collaborative ltering with item popularity approach..................... 52
2.5.5. Graph Neural Network with item popularity approach.................... 53
Chapter 3. Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.1. Map@k................................................................... 61
3.1.2. RecSys2016 Score . . ...................................................... 62
6
3.2.Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Conclusion and Future work ........................................... 69
7
List of Figures
1 Search results from careerbuilder.vn at 08/04/2023 ..............11
1.1 Neuralnetworkarchitecture ...........................19
1.2 Basic graph neural network illustration .....................19
1.3 Content-basedrecommendationsystem.....................23
1.4 Used-basedvsItem-basedinmemory-basedcollaborativeltering ......24
2.1 User’s university degree distribution in the RecSys2016 dataset ........33
2.2 User’s total experience year distribution in the RecSys2016 dataset .....34
2.3 User’s current experience year distribution in the RecSys2016 dataset ....34
2.4 User’s number of experience entry distribution in the RecSys2016 dataset . . 35
2.5 Job’s employment type distribution in the RecSys2016 dataset ........35
2.6 Job’s active during test status distribution in the RecSys2016 dataset ....36
2.7 User and job’s career level distribution in the RecSys2016 dataset ......36
2.8 Top 5 user and job’s discipline id distribution in the RecSys2016 dataset . . 37
2.9 Top 5 user and job’s industry id distribution in the RecSys2016 dataset . . . 38
2.10 User and job’s country distribution in the RecSys2016 dataset ........38
2.11 Interaction type distribution in the RecSys2016 dataset ............39
2.12 CareerBuilder2012 data layout ..........................40
2.13 User’s degree type distribution in the CareerBuilder dataset .........42
2.14 Top 20 user’s major distribution in the CareerBuilder dataset ........42
2.15 Top 20 user’s job distribution in the CareerBuilder dataset ..........42
2.16 User’s years from graduation distribution in the CareerBuilder dataset . . . 43
2.17 User’s total experience years ( 40) distribution in the CareerBuilder dataset 44
2.18 User’s CurrentlyEmployed status distribution in the CareerBuilder dataset . 44
2.19 User’s ManagedOthers status distribution in the CareerBuilder dataset . . . 45
2.20 Top 20 item’s job titles in the CareerBuilder dataset ..............45
2.21 User and item’s US state distribution in the CareerBuilder dataset ......46
2.22 Item’s active status distribution in the CareerBuilder dataset .........47
2.23 Popularity score calculation illustration .....................50
2.24 Simple collaborative filtering illustration ....................52
2.25Bipartitegraphdemonstrationinrecommendationsystems..........53
3.1 RGCN’s loss curve during training in RecSys2016 ...............65
3.2 RGCN’s loss curve during training in CareerBuilder2012 ...........65
8
List of Tables
2.1 Table Users in the RecSys2016 dataset .....................31
2.2 Table Items in the RecSys2016 dataset .....................57
2.3 Table Interactions in the RecSys2016 dataset .................58
2.4 Table Impressions intheRecSys2016dataset.................58
2.5 RecSys2016 dataset statistics ..........................58
2.6 Table users in CareerBuilder2012 dataset ...................59
2.7 Table user
history in CareerBuilder2012 dataset ...............59
2.8 Table jobs in CareerBuilder2012 dataset ....................59
2.9 Table apps in CareerBuilder2012 dataset ....................60
2.10 Table window
dates in CareerBuilder2012 dataset ..............60
2.11 Anonymized textual fields processing example .................60
3.1 Performance of Matching model with different weights in the Recsys2016
dataset.......................................63
3.2 Performance of Matching model with different weights in the Career-
Builder2012 dataset ................................64
3.3 Performance of Content-based model with different weights in the Rec-
sys2016 dataset ..................................66
3.4 Performance of Content-based model with different weights in the Ca-
reerBuilder2012 dataset ..............................66
3.5 Performance of different models in the RecSys2016 dataset ..........67
3.6 Performance of different models in the CareerBuilder2012 dataset ......67
9
Introduction
Preface
Online job portals such as careerbuilder.vn, timviecnhanh.com, and vietnamworks.com
have emerged as one of the primary platforms utilized by job seekers to facilitate the job
search process. These platforms have saved a lot of time and money for both candidates
as well as recruiting organizations. However, the traditional technique of retrieving in-
formation through search engines by keywords, such as ”reverse index”[17], which relies
solely on keywords for document mapping, is no longer suitable since the number of re-
sults returned could be large, and unrelated to the candidate (not personalized) leading
to a poor user experience.
An illustration of the job search results generated by one of the above-mentioned
online recruitment platforms is presented in Figure 1. A search using the keyword
”java developer,” within the ”IT - Software” industry and ”Hanoi” location, yielded
104 job postings. This is a high number of results, which is not personalized and shown
to every different job seeker causing them to spend a large amount of time reading and
considering these carefully. Due to the difficulty of reading every job posting from the
search results, the overall service quality is not optimal.
Since what is considered a good job posting for one person may not be as fine for
another person, it is stated that personalization is a crucial factor when it comes to
ranking job search results. In the context of a job recommendation system - a vital
component that plays a significant role in the user experience of job portals, ”personal-
ize” refers to the process of tailoring job recommendations to the individual preferences,
skills, experience, and background of each job seeker. This means that the system takes
into account the unique characteristics of each user and provides them with job post-
ings that are most relevant and interesting to them. With this in mind, the goal of
the ”Personalize job recommendation system” thesis is to build an effective job
recommendation system that increases personalization and relevance in job search re-
sults. The thesis aims to improve the user experience by leveraging a recommendation
system that considers an individual’s personal profile to suggest job postings that are
the most relevant to them, which results in the job searching journey becoming more
easy, efficient and more interesting, leading to better job matches and ultimately more
successful for job seekers.
10
Figure 1: Search results from careerbuilder.vn at 08/04/2023
Challenges
Data challenges
Lack of available labor data. Any algorithm needs huge, objective data sets in
order to be properly effective. Millions of objective resumes and clear, accurate job
postings are required for recruiting. It required years to gather this amount of objective
data. According to [7], ”The problem with today’s candidate matching technology is
not with machine learning technology. The issue is a lack of viable information with
the two elements being matched: candidate resumes and employer job descriptions.”
Available labor data, including user profiles, job listings, and job seeker behavior
may not be enough, particularly in a niche market or a specific industry. This data has
a high level of privacy since it contains a lot of information about users’ profiles and
activities. One does not always have access to these limited data sources or sufficient
data collection methods.
Semantic gap. E-recruitment recommendation systems suffer from a semantic gap
between contents from diverse sources, such as resumes and job descriptions, because
the textual data is created by different people. It’s possible that the same concept was
referred by different terms. Additionally, the same term’s meaning might vary based on
the situation [10].
The first factor that hinders the success of the recommendation system is the fact
11
that the resumes are not objective. Resumes have served as the primary starting point
for evaluating candidates. But despite the good meaning, the resume is still highly
subjective. At the beginning step of the working world, everyone is taught that a resume
must present itself professionally in front of potential employers. Often, resumes include
hyped headlines, selected accomplishments, exaggeration or, more seriously, falsehoods.
In the end, a resume is really an advertisement for a person, with every detail being
used to best polish the candidate.
Next, resumes carry different skill descriptions. For example, “proficiency” in Mi-
crosoft Excel is different for each person. It might mean knowing how to sort data by
columns for one person and the ability to create advanced charts for another. It is not
possible to objectively determine a person’s qualifications based on a resume alone, and
some sort of assessment is required.
Finally, more and more candidates add attention-grabbing keywords to their resumes
(including exact keywords in job requirements) in the hope of tricking applicant tracking
systems or systems. other keyword matching system. This adds noise and contributes
to false matching results.
The second hindrance to the success of a recommendation system is the poor quality
of most job descriptions. There is a certain difference between the actual job and the
description of the same job. This is mainly due to not prioritizing the time it takes
to write a good job description, and also because it is really hard. Usually, the job
description posted by the company is the one that has been around for a while or is
found on a certain website, with a little customization then added in the specific title,
manager, and recruiter. This causes important skills or requirements to be missed and
recruiting based on particular job characteristics becomes very difficult.
Why does this happen? Following the correct process, the job description requires
considerable thought and time from the job manager, who is often busy and doesn’t
prioritize this supposedly administrative work. As a result, job descriptions are usually
written by human resources. Sometimes, a job posting writer is responsible for writing
descriptions for many different positions. Job descriptions, therefore, are incomplete
leading to the application of unqualified candidates. In addition, the retention rate will
be higher if the candidate has a clearer picture of the job from the beginning.
Despite playing a large role in the recruitment process in particular and the devel-
opment of the company in general, many job descriptions lack important elements such
as a transparent job title, a clear overview, and information about the job. job position,
list of necessary skills, qualifications as well as salary and bonus.
Technical challenges
Cold start problem. The cold start problem is one of the major challenges faced
by job recommendation systems. This problem arises when a new user or a new job
is added to the system, and there is no historical data available for the user or job.
In such cases, the system cannot make accurate recommendations based on the user’s
12
preferences or job requirements.
In the case of new users, the system may rely on the user’s profile information, such
as their education, work experience, and skills, to make initial job recommendations.
However, this approach may not be effective as the user’s preferences and interests may
not be reflected accurately in their profile. Additionally, the user may not have a clear
idea of what they are looking for in a job, which makes it challenging for the system to
provide relevant recommendations.
Similarly, in the case of new jobs, the system may utilize job features such as titles,
descriptions, and requirements,... for initial recommendations. Nevertheless, not having
enough data to accurately match the job with the appropriate candidates can result in
the job being recommended to the wrong candidates, which can be frustrating for both
the job seeker and the employer.
Skill extraction. Skills are the most crucial factor in matching job seekers with job
postings, even though many aspects may be implicit and need to be extracted using well-
planned approaches as well. To improve the efficacy of e-recruitment recommendation
systems, it is vital to use both expertise and skills of job seekers and the required
skills listed in the job postings. Thus, skill extraction from the textual data is another
challenging task in the e-recruitment recommendation systems [10].
Skill extraction from the textual data is essential for job recommendation systems
since candidates’ profiles and job descriptions are often available as free text with no
structure. Some NLP techniques have been employed such as n-gram tokenization,
NER, part-of-speech tagging (PoS tagging), skill dictionaries or ontology utilization,...
to extract skills from the text. Job seekers and job postings’ skills can then be further
processed using skill similarities or relations provided by word embedding models (e.g.,
word2vec), and by domain-specific ontologies or skill taxonomies. Our occupational
skills synonym prediction [5] is also researched and considered for this skills processing
purpose.
Large scalability. Scalability is one of the key challenges in job recommendation
systems due to the ever-growing amounts of data. They need to handle huge amounts of
data as real-world job portals have to deal with millions of job seekers and job postings.
Thus, recommending at large scale needs to be considered in these platforms.
One of the main challenges in dealing with large-scale data is the need for efficient
data storage and retrieval. Storing and managing large amounts of data can be expensive
and time-consuming, and retrieving data quickly can be a challenge, especially when
dealing with complex data structures.
Another challenge is the need for scalable algorithms that can handle large amounts
of data. Traditional machine learning algorithms may not be able to handle the amount
of data required for job recommendation systems, as they may be too slow or too
memory-intensive. More specifically, problems with performance and memory/storage
usage may arise in both the training and inference phases when dealing with enormous
amounts of data.
To deal with the execution time and consumed storage/memory issues during the
13
training phase, a study from CareerBuilder [14] created an item-based graph of jobs
with edges representing job similarities based on behavioral and content-based features
instead of a user-based (job seeker based) or user-item (job-job seeker) graph for scala-
bility. The recommendations were generated on a subgraph of this job graph which was
selected by a job seeker’s resume or past clicks.
To deal with the response time in the inference phase, recommendation systems
commonly use a two-step approach. In the first step, a computationally inexpensive
model is utilized to select a pool of potential job postings from a large number of items.
The second phase then reranks the results using a more expensive model.
Additionally, as the amount of data grows, the need for real-time processing and
analysis becomes increasingly important. Users expect quick and accurate job recom-
mendations, which means that the system must be able to process large amounts of
data in real-time to provide personalized recommendations that are relevant to each
individual user.
Assumptions and problem definition
Assumptions
Due to the challenges discussed above, for simplicity without losing the essence of
the problem, some assumptions have been made as follows:
A user’s resume/profile is an accurate description of the candidate or job seeker.
A job posting is an accurate and complete description of the actual job.
Additionally, all data used in this thesis, including user profiles and job postings are
in English.
Problem definition
The problem definition of ”Personalizing the job recommendation system”
can be described specifically as follows:
Input:
The set U contains information about the user.
The set J contains information about job postings.
The set I contains information about interactions between users and job
postings.
Output: The recruitment file J
J is sorted in descending order of relevance.
14
Contributions
In this thesis, a combination of different techniques, tools, and models in the fields
of natural language processing, machine learning, and recommendation systems is re-
searched and utilized for job recommendation system construction. The author’s specific
contributions include:
Contribute to the overall understanding of the job market and job seeker behavior.
By analyzing data on job listings and job seeker behavior and profile from two
labor datasets: RecSys2016 and CareerBuilder2012, the thesis can provide valu-
able insights to recruiters, human resources, and organizations looking to better
understand the job market and improve their recruitment strategies.
Research, implement, and experiment with various job recommendation algo-
rithms: item popularity, user-item matching, content-based, user-based collabo-
rative filtering, and graph neural network on two above-mentioned labor datasets.
Evaluate the performance and usability of the different algorithms in the context
of job recommendation systems based on two metrics: Map@K and RSScore of
these algorithms, which helps to give a more comprehensive understanding of the
factors that impact the effectiveness of job recommendation systems.
The thesis is structured into 3 chapters and introduction, conclusion parts:
Introduction serves as the introduction, providing reasons for choosing the topic,
challenges, assumptions, and contribution of the thesis.
Chapter 1. Theoretical basis covers the theoretical background of the study,
discussing the various techniques and algorithms used in recommender systems.
Chapter 2. Proposed approaches for job recommendation provides a
detailed description and analysis of the dataset and outlines the approaches used
in the study, including the preprocessing of data, and algorithm implementation
details.
Chapter 3. Experiments and results covers the experiments conducted and
evaluation metrics to evaluate the performance of the different algorithms.
Conclusion and Future work summarizes the findings of the study and presents
future research directions.
15
Chapter 1
Theoretical basis
1.1. Foundation algorithms
1.1.1. TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used tech-
nique to represent textual values as numerical vectors. It is a statistical measure that
determines the importance of a word in a document or corpus and is calculated as the
product of two factors: term frequency (TF) and inverse document frequency (IDF).
tf(t, d)=
f(t, d)
max{f(w, d):w d}
(1.1.1)
The term frequency measures the frequency of a word in a particular document and is
calculated as the number of times a word appears in the document divided by the total
number of words in the document. Equation 1.1.1 is the formula of TF, in which:
f (t, d) is the number of occurrences of the word t in the document d.
max{f (w, d):w d} is the max number of occurrences of a word in the document
d.
idf (t, D)=log
|D|
|{d D : t d}|
(1.1.2)
The inverse document frequency measures the rarity of a word in all documents and is
calculated as the logarithm of the total number of documents in the corpus divided by
the number of documents containing the word. Equation 1.1.2 is the formula of IDF,
in which:
|D| is the total number of documents.
|{d D : t d}| is the number of documents in D that has the word t.
tfidf(t, d, D)=f(t, d) × idf (t, D) (1.1.3)
16
The TF-IDF score for a word in a specific document is obtained by multiplying its
term frequency with its inverse document frequency, which is described in Equation
1.1.3. This score represents the significance of the word in the document and is utilized
as a feature in the textual value representation. Words with high TFIDF values are
words that appear more often in the document under consideration and rarely in other
documents. This reduces the influence of common words and increases the importance
of high-value words (keywords) in documents.
1.1.2. Cosine Similarity
cos(θ)=
A · B
A
2
B
2
=
n
i=1
A
i
B
i
n
i=1
(A
i
)
2
n
i=1
(B
i
)
2
(1.1.4)
Given two vectors A and B, the cosine similarity or the cos(θ) of the angle between these
two vectors is calculated based on the dot product and the magnitude of the vectors as
Equation 1.1.4, in which, A
i
and A
i
are respectively the elements of the vectors A and
B.
cos(θ) value is in [1, 1] range:
cos(θ) = 1 means the angle between two vectors is 0 degrees and A and B are
exactly identical.
Conversely, cos(θ)=1meansA and B are completely opposite.
cos(θ) = 0 indicates orthogonality between the two vectors, i.e. A and B are not
related.
The larger the value of cos(θ) (advancing towards 1), the more similar the two
vectors A and B are and vice versa, the smaller (advancing towards 1), the more
different between the two vectors.
1.1.3. K-nearest neighbors
K-nearest neighbors (KNN) is a simple yet powerful non-parametric machine learn-
ing algorithm used for both regression and classification problems. The KNN algorithm
is based on the assumption that similar data points are close to each other in the feature
space.
The KNN algorithm works by first choosing a value for K, which represents the
number of nearest neighbors to consider when making a prediction. Then, given a new
data point, the algorithm finds the K nearest neighbors in the training set based on
the distance metric used. Finally, the algorithm makes a prediction based on the most
common class or the average value of the K nearest neighbors.
The distance metric used in KNN can vary depending on the problem and the data.
Euclidean distance is commonly used for continuous data, while Manhattan distance or
17
Hamming distance can be used for categorical data. Cosine similarity discussed in 1.1.2
is also a commonly used distance metric in KNN.
KNN is a simple algorithm that does not require training, making it a popular choice
for many applications. However, its performance can be impacted by the choice of K
and the distance metric used. In addition, it can become computationally expensive as
the number of training examples increases.
1.1.4. Overview of neural network
In recent years, neural networks have gained increasing popularity and have become
a prominent technique in machine learning. Neural networks are a type of machine
learning model that is designed to mimic the human brain’s structure and function.
They are made up of layers of interconnected nodes, and each node is responsible for
processing input data and producing an output. Neural networks are widely used in a
variety of fields, including computer vision, natural language processing, speech recog-
nition, and many others.
The basic building block of a neural network is the perceptron, which is a single
node that performs a simple calculation on its inputs. A perceptron takes in a vector
of input values and multiplies them by a corresponding set of weights, which are then
summed together with a bias term. The result is then passed through an activation
function to produce the output.
A common activation function used in neural networks is the sigmoid function, which
maps the output to a value between 0 and 1. Another popular activation function is
the rectified linear unit (ReLU) function, which outputs the input value if it is positive
and 0 otherwise.
A neural network is typically made up of multiple layers of perceptrons (MLP),
shown in Figure 1.1, with each layer processing the output from the previous layer.
The input layer of a neural network consists of the raw input data, such as pixel values
in an image or word embeddings in natural language processing. The output layer
produces the final prediction, which could be a classification label, a probability value,
or a regression value.
Training a neural network involves adjusting the weights and biases of the percep-
trons to minimize the difference between the predicted output and the actual output.
This is typically done using a loss function, which calculates the difference between the
predicted output and the ground truth. The backpropagation algorithm is then used to
update the weights and biases of the perceptrons in the network to minimize the loss
function.
In summary, neural networks are a powerful tool for solving complex machine-
learning problems. They are widely used in various fields and can be trained to perform
tasks such as classification, regression, and prediction. Understanding the basics of neu-
ral networks is essential for building and designing effective machine learning models.
18
Figure 1.1: Neural network architecture
Figure 1.2: Basic graph neural network illustration
1.1.5. Overview of graph neural network
Graph Neural Networks (GNNs) are a type of neural network designed to work on
graph-structured data, where the input data consists of nodes and their relationships
with each other. GNNs have become increasingly popular in recent years, especially in
the field of recommendation systems.
The key idea behind GNNs is to learn node representations by aggregating infor-
mation from neighboring nodes, which is illustrated in Figure 1.2. This is achieved
through a series of message-passing operations, where each node in the graph updates
its representation based on the representations of its neighbors. The information is then
propagated throughout the graph, with each node refining its representation based on
the information received from its neighbors.
Message passing and aggregation are key operations in Graph Neural Networks
(GNNs). In a GNN, each node in the graph is associated with a feature vector, and the
19
relationships between nodes are represented by edges between them. Message passing
refers to the process by which information is propagated across these edges in the graph.
At each node, the features of its neighboring nodes are aggregated and combined with
their own features to form a new representation of the node.
The aggregation process involves taking into account all the information coming
from a node’s neighbors and summarizing it in a way that can be used to update the
node’s own features. There are various ways to perform aggregation, including mean
pooling, max pooling, and attention-based aggregation, among others. The choice of
aggregation method can have a significant impact on the performance of the GNN.
At each layer l,foreachnodev, the node updates its representation based on the
messages received from its neighbors:
m
(l)
uv
= f
message
(h
(l1)
v
,h
(l1)
u
,e
(l1)
uv
)
m
(l)
v
=
u∈N (v)
m
(l)
uv
h
(l)
v
= f
update
(h
(l1)
v
,m
(l)
v
)
(1.1.5)
where:
h
(l)
v
is the hidden representation of node v at layer l.
e
uv
is the edge feature from node u to node v.
N (v) is the set of neighbors of node v.
f
message
is a function that computes the message from node u to node v.
f
update
is a function that updates the hidden representation of node v based on
the aggregated messages.
GNNs have been shown to be effective in a variety of tasks, such as node clas-
sification, link prediction, and recommendation systems. They can handle large and
complex graphs, as well as graphs with varying sizes and structures. In addition, GNNs
can incorporate additional features or attributes of the nodes and edges, making them
a flexible tool for graph-based data analysis.
However, GNNs also pose some challenges. One of the main challenges is the issue
of over-smoothing, where the representations of nodes become too similar after multiple
iterations of message passing, leading to a loss of information. Another challenge is
the difficulty of training GNNs on large-scale graphs, as the computation and memory
requirements can be high.
20
1.2. Overview of recommendation system
Recommendation systems or recommender systems are an essential part of modern
e-commerce, social media, and other online platforms. A recommender system has two
main entities, users who interact with the system and items which are products, such
as songs, books, videos, movies, products, articles,... or other users that interacted
with users. These systems use algorithms to provide personalized recommendations to
users based on their interests, behavior, and preferences for the sake of improving user
engagement and satisfaction in online platforms, turning potential customers into real
customers, and increasing the operating performance and revenue of the system.
In general, there are three basic types of recommendation systems: collaborative
filtering, content-based filtering, and hybrid recommendation systems. Two other basic
recommendation approaches: item popularity and user-item matching are also intro-
duced. These recommendation methods are clarified in the following subsections.
1.2.1. Item popularity recommendation
Item popularity is a straightforward yet effective approach that can be used to
provide users with job recommendations based on the assumption that popular items
are more likely to be relevant to a larger number of users. Item popularity refers to
the degree to which an item is favored by users. By recommending popular items, the
system can provide users with opportunities that have already proven to be attractive
to others, increasing the likelihood of a successful match.
Measuring item popularity in a recommendation system can be done using various
metrics, depending on the domain. Some common metrics include:
Number of views: The total number of times an item has been viewed by users
can be an indicator of its popularity.
Number of purchases or downloads: The total number of purchases or downloads
of an item can also be used to gauge its popularity.
Positive feedback: User ratings, reviews, or other forms of positive feedback can
be used to measure the appeal of an item.
Item popularity offers several advantages as a recommendation technique. It is rel-
atively easy to implement and requires minimal computational resources compared to
more complex techniques. Item popularity can be used to mitigate the cold-start prob-
lem by recommending items to new users who have not yet provided enough information
for more personalized recommendations. Also, popular items are likely to be relevant
to a larger number of users, increasing the chances of user satisfaction.
Despite its advantages, item popularity also has some limitations. Item popularity
does not take into account individual user preferences, which does not provide person-
alized recommendations. Also, relying solely on item popularity can lead to a biased
21
system that only recommends the most popular items, potentially overlooking less pop-
ular but more suitable opportunities for individual users.
To address the limitations of item popularity, hybrid approaches combining item
popularity with other techniques, such as content-based filtering or collaborative filter-
ing, can provide a more personalized and effective recommendation experience for users.
Contextual information such as user demographics or location can also be taken into
account to help tailor popularity-based recommendations to specific user segments.
1.2.2. User-item matching recommendation
The user-item matching recommendation represents users and items as feature vec-
tors that capture their properties and characteristics. These feature vectors can be
created by analyzing the textual, visual, or other content associated with the users and
items.
Once the feature vectors are obtained, the recommendation process involves comput-
ing the similarity between the feature vector of the target users and the feature vector
of the target items to identify how similar they are to each other. To compare the user’s
and the item’s representation vectors, a similarity measure such as cosine similarity can
be used. Once the similarity between the user and the item is determined, the most
similar items are then recommended to the target users.
The user-item matching recommendation offers several advantages. By comparing
the user and item representation vectors, the system can provide personalized recom-
mendations based on the specific preferences and characteristics of each user. It does not
require information about interactions between users and items, making them suitable
for cold-start scenarios.
Despite its advantages, the user-item matching recommendation also has some draw-
backs. This approach is limited by the availability of feature data for both users and
items. If the data is incomplete or biased, the recommendations may not be accurate.
1.2.3. Content-based recommendation system
Content-based recommendation systems [2] are a class of recommendation systems
that leverage the properties and features of items to recommend similar items to users.
These systems are based on the assumption that if a user has shown interest in a partic-
ular item in the past, then the user is likely to be interested in items that share similar
properties or features. Figure 1.3 demonstrates a basic content-based recommendation
system.
Similar to the user-item matching recommendation approach, the main idea behind
the content-based recommendation system is to represent item entities as feature vectors
that capture their properties and characteristics. These feature vectors can be created
by analyzing the textual, visual, or other content associated with the items.
Once the feature vectors are created, the recommendation process involves com-
puting the similarity such as cosine similarity between the feature vector of the user’s
22
Figure 1.3: Content-based recommendation system
preferred item and the feature vectors of other items in the system. The most similar
items are then recommended to the user.
Content-based recommendation systems have several advantages over other types
of recommendation systems. They do not require information about the preferences of
other users. Additionally, content-based recommendation systems can provide expla-
nations for the recommendations they make, which can help users understand why a
particular item is being recommended. However, content-based recommendation sys-
tems also have limitations. They rely on the quality and availability of item features,
which may not be available for all types of items. Additionally, they may suffer from
the problem of overspecialization, where users are only recommended items that are
similar to their past preferences, leading to limited exploration of new items.
Many research considers the user-item matching method as a type of content-based
recommendation system because they all utilize the user’s and item’s profile for making
recommendations. Nevertheless, this thesis separates this approach from the content-
based approach because of some differences between them.
1.2.4. Collaborative filtering recommendation system
Collaborative filtering [2] is one of the most widely used techniques in recommen-
dation systems. It is a type of algorithm that uses historical data, typically stored in
a user × items rating matrix, to make recommendations to other users. There are two
main types of collaborative filtering: memory-based and model-based.
Memory-based collaborative filtering is a type of recommendation system that
utilizes users’ historical behavior to make predictions on their preferences for that item.
The similarity between users is calculated using metrics K-nearest neighbor (KNN) al-
23
Figure 1.4: Used-based vs Item-based in memory-based collaborative filtering
gorithm, which is discussed in 1.1.3. The memory-based collaborative filtering approach
can be divided into two categories: user-based and item-based collaborative filtering.
In user-based collaborative filtering, the system identifies users who have similar
preferences to the target user and then recommends items that these similar users have
liked or rated highly. This approach assumes that users who have had similar preferences
in the past will have similar preferences in the future. For example, in Figure 1.4 (a),
John and Tim have similar behavior (both interacted with the second and fourth items)
so John can be recommended with the first and last items, which interacted by Tim.
On the other hand, in item-based collaborative filtering, the system identifies items
that are similar to the items that the target user has liked or rated highly. The system
then recommends these similar items to the target user. This approach assumes that
users will like items that are similar to the items they have liked in the past. For
example, in Figure 1.4 (b), the first and fourth items are similar because they are both
interacted by Tim and Amy, so the fourth item can be recommended to John since he
interacted with the first item.
One of the main advantages of memory-based collaborative filtering is its simplicity
and ease of implementation. It does not require any complex modeling or training
processes. Instead, it relies solely on the similarity between users or items. However,
memory-based collaborative filtering also has some limitations. One major issue is that
it can be computationally expensive when dealing with a large number of users and
items. Despite these limitations, memory-based collaborative filtering remains a popular
and effective approach for recommendation systems, especially for smaller datasets with
a relatively stable user-item interaction history.
Model-based collaborative filtering, also known as matrix factorization, is an
approach that uses a model to learn the latent factors that explain the ratings of users.
24
The idea behind this approach is that users and items are characterized by a set of
latent factors, which can be learned from the historical rating data. The learned factors
can then be used to predict the rating of a user for an item. This method is typically
implemented using matrix factorization techniques such as singular value decomposition
(SVD), non-negative matrix factorization (NMF), or alternating least squares (ALS).
In recent years, deep learning techniques such as neural networks have been applied
to collaborative filtering. These methods are known as deep collaborative filtering,
and they learn the representations of users and items directly from the data. Deep
collaborative filtering models can be divided into two categories: neural network-based
models and graph-based models.
Neural network-based models use feedforward neural networks or recurrent neural
networks to learn the representations of users and items. The inputs to the network
are the user and item features, and the output is the predicted rating or preference
score. These models can be trained using standard backpropagation algorithms and
loss functions such as mean squared error or cross-entropy.
Graph-based models represent users and items as nodes in a graph, where the edges
represent the interactions between users and items. These models use graph neural
networks (GNNs) to learn the representations of users and items and to make recom-
mendations. GNNs can capture complex interactions between users and items and can
model high-order dependencies in the data. Graph-based models have shown promising
results in several recommendation tasks, including item recommendation, social recom-
mendation, and sequential recommendation.
Model-based collaborative filtering has several advantages over memory-based col-
laborative filtering. It can handle sparse and large datasets, it can capture the non-
linear relationships between users and items, and it can incorporate additional features
or contextual information about the users and items. However, it also requires more
computational resources and may suffer from overfitting if the model is too complex or
thedataistoonoisy.
1.2.5. Hybrid recommendation system
Hybrid recommendation systems [4] combine multiple recommendation techniques
to address the limitations of individual methods and provide more accurate and di-
verse recommendations. Hybrid systems can be designed using various combinations of
collaborative filtering, content-based filtering, and other recommendation techniques.
There are several approaches to building hybrid recommendation systems:
Weighted: This approach combines the scores from different recommendation
methods by assigning weights to each method based on its performance. For
example, a hybrid system may combine collaborative filtering and content-based
filtering, with the weight assigned to each method based on the accuracy of the
recommendations.
25
Switching: This approach selects the recommendation method based on the char-
acteristics of the user or item being recommended. For example, a hybrid sys-
tem may use collaborative filtering for new users with little data, and switch to
content-based filtering once enough data is collected.
Feature combination: This approach combines the features used by different rec-
ommendation methods to create a hybrid feature space. For example, a hybrid
system may use collaborative filtering and content-based filtering, with the feature
space including both user-item interactions and item attributes.
Feature augmentation: This approach utilizes a contributing recommendation
model that is employed to generate a rating or classification of the user/item
profile. This result is further used in the main recommendation system to pro-
duce the final predicted result.
Cascade: This approach applies different recommendation methods in a sequence.
For example, a hybrid system may first use collaborative filtering to generate a
set of candidate items, and then use content-based filtering to refine the recom-
mendations.
Hybrid recommendation systems have been shown to outperform individual recom-
mendation methods in many cases, and are commonly used in industrial applications.
However, designing and implementing a hybrid system can be challenging, as it requires
expertise in multiple recommendation techniques and careful consideration of how the
methods are combined.
1.3. Related work
Content-based approaches only use semantic measures between the user profile
and the set of available jobs to make recommendations. In a content-based recommen-
dation system, vector representations of job postings are created using Bag of Words
(BoW) with TF-IDF weighting [11]. According to [13], content-based recommendation
system contributions have been stable for the past 10 years although their results are
not significant.
Collaborative filtering approaches are based solely on behavioral data, often
stored in a user × items rating matrix. In the labor marketing context, this matrix
could be filled with click behavior: the rating matrix element value equals 1 if the job
seeker (row) clicked on a vacancy (column) to view the job information, 0 otherwise).
Memory-based collaborative filtering is created using a K-nearest neighbor (KNN)
method. In other words, for a given user u, one looks for K people who are similar to
u (user-based collaborative filtering) or K items that are similar to the entities u has
already interacted with (item-based collaborative filtering). In this case, the entries of
the rating matrix are always used to determine similarity [8].
26
Regression-based models are used in model-based collaborative filtering in an effort
to fill in the missing values in the rating matrix using only data from the rating matrix.
In recent years, deep neural networks have become common strategies, accounting for
approximately 50% of all contributions in 2019 and 2020 according to [13]. The increas-
ing usage of Deep neural networks in recommendation systems is not limited to jobs
but holds for recommender systems in general.
Hybrid approaches discusses four approaches: cascade hybrids, feature augmen-
tation, weighted hybrids, and switching hybrids.
Cascade hybrids refine the recommendation given by another recommender system.
These commonly include (gradient) boosting models [15] with XGBoost being particu-
larly popular. Besides boosting, also refinements using common content-based recom-
mendation or collaborative filtering methods have been proposed.
Feature augmentation uses the result from the previous recommendation system as
an input for the next model [6].
In weighted hybrids, the output of the separate models is combined using some
(possibly non-linear) combination of the predicted scores. Commonly, content-based
recommendation and model-based collaborative filtering are in this way combined [12].
Collaborative filtering often suffers from the cold-start problem. Despite multiple
hybrid approaches to resolving this problem, the most direct approach is possible to use
switching hybrids. This often implies that the recommender system uses collaborative
filtering by default. However, if an item or a user has insufficient data, the recommender
system switches to other recommendation algorithms such as a content-based approach,
and user-item matching [16].
Validation when lacking interaction. Following [13], researchers have limited
access to interactions between job seekers and job postings, which naturally impacts
the type of methods and validation that is used.
In the matter of lack of interaction data, several strategies are proposed to still be
able to evaluate their recommendations:
Use one of the competition datasets for validation and training even after the
competition had finished.
Expert validation is also used frequently for validation. the quality of recom-
mendations is examined by a group of ‘experts’, which may be the researchers
themselves, HR/recruitment experts, or sometimes students [9].
Use the previous N jobs in the experience section of a resume to predict the
N +1-thjob.
27
Chapter 2
Proposed approaches for job
recommendation
This chapter presents the overall model construction of the personalized job recom-
mendation system and explains the selected recommendation techniques, including item
popularity, user-item matching, content-based with item popularity, user-based collab-
orative filtering with item popularity, and graph neural networks with item popularity.
2.1. Overall model construction of the j ob recom-
mendation system
The personalized job recommendation system aims to provide job seekers with a
list of jobs that best match their preferences. The system takes into account various
factors, such as job seekers’ profiles, job postings, and historical interactions between
job seekers and job postings. The overall model construction consists of the following
components:
Dataset description and analysis: The first step is to collect data sources and
analyze them. Two datasets - RecSys2016 and CareerBuilder2012, which are
the only suitable datasets for the job recommendation system that the thesis
could find to the author’s knowledge, are collected from ACM RecSys Challenge
2016 and Kaggle’s Job Recommendation Challenge competitions, described and
analyzed to understand their characteristics, such as the size of the datasets,
the types of features available, and any data quality issues. This analysis is
important to inform the data preprocessing steps and the selection of appropriate
recommendation techniques.
Data labeling: In this component, the users, items, and interactions in each
dataset are then split into a training set and a test set for model construction
and evaluation.
Data preprocessing: In this component, the datasets are preprocessed and pre-
28
pared for use in the recommendation model. This involves tasks such as data
cleaning, feature selection, normalization, and feature engineering. The goal of
data preprocessing is to transform the raw data into a format that can be used
by the recommendation model.
Recommendation model implementation: In this component, the job recommen-
dation system will be implemented using selected techniques such as item pop-
ularity, user-item matching, content-based with item popularity, user-based col-
laborative filtering with item popularity, and graph neural networks with item
popularity. The recommendation model will take the preprocessed dataset as
input and generate personalized job recommendations for each user.
The model will be evaluated based on metrics MAP@K and RSScore to determine its
effectiveness in providing relevant job recommendations to users, which is presented in
the next Chapter.
2.2. Dataset description and analysis
As mentioned in section 1.3, there is a lack of labor market data, especially the
interaction data between job seekers and job postings due to its high level of privacy.
This makes competition play an important role in the job recommendation literature.
Each of these competitions shares a dataset, an objective, and an error measure for
teams to enroll and construct a job recommendation for a set of hold-out users.
Besides being used for contributions to the competitions themselves, the datasets
are also commonly used to train and validate job recommendation systems even after
the contest ends and the test data from the contest is no longer available. Given
that approximately 32% of all job recommender system contributions use a dataset
originating from the competition, these datasets have significantly impacted the job
recommender literature [13].
In this thesis, RecSys2016 and CareerBuilder2012 are two datasets used to build and
evaluate the job recommendation system. In this thesis, the 2012 Careerbuilder Kaggle
competition [3] dataset from the United States job board CareerBuilder and the RecSys
2016 competition [1]fromtheGermanjobboardXing.
2.2.1. RecSys2016 dataset
The dataset is a semi-synthetic sample of XING data, in that it is enriched with
artificial users whose presence contributes to the anonymization. This dataset has the
following characteristics:
The dataset contains artificial users.
The dataset contains only a fraction of XING’s users and job postings.
IDs are used instead of raw text for almost all attribute values.
29
Some user fields may be removed or changed to NULL/unknown.
Not all interactions are included in the dataset.
Some interactions are artificial, not actually done by the user.
Timestamps have been changed but the order of the interactions remains the same.
The RecSys2016 dataset includes the four following main tables:
Table Users contains detailed information about the users that appear in the
dataset. As described above, this data has been refined for the purpose of anonymiza-
tion. Table 2.1 describes the data fields in the Users table.
Field name Type Description
id Int Anonymized ID of the user
jobroles Text Comma-separated list of job role terms (nu-
meric IDs) that were extracted from the
user’s current job title. 0 means that there
was no known job role detected for the user
career level Int Career level ID:
0: unknown
1: Student/Intern
2: Entry Level (Beginner)
3: Professional/Experienced
4: Manager (Manager/Supervisor)
5: Executive (VP, SVP, etc.)
6: Senior Executive (CEO, CFO, Pres-
ident)
discipline id Int Anonymized IDs represent disciplines such as
”Consulting”, ”HR”, etc
industry id Int Anonymized IDs represent industries such as
”Internet”, ”Automotive”, ”Finance”, etc
country Text The country in which the user is currently
working. de, at, ch, non
nach respectively
represents ”Germany”, ”Austria”, ”Switzer-
land”, and ”Other country”
region Int Specified for some users who have as country
de
30
experience n entries class Int The number of CV entries that the user has
listed as work experiences. 0, 1, 2, and 3
respectively represent ”no entries”, ”1-2 en-
tries”, ”3-4 entries”, and ”5 or more entries”
experience years experience Int The estimated number of years of work ex-
perience that the user has:
0: unknown
1: less than 1 year
2: 1-3 years
3: 3-5 years
4: 5-10 years
5: 10-15 years
6: 16-20 years
7: more than 20 years
experience years in current Int The estimated number of years that the user
is already working in her current job. The
meaning of numbers is the same as experi-
ence
years experience
edu degree Int Estimated university degree of the user. 0, 1,
2, and 3 respectively represent ”unknown”,
”bachelor”, ”master”, ”phd”
edu fieldofstudies Text Comma-separated fields of studies that the
user studied. 0 entries mean ”unknown” and
bigger than 0 entries refer to broad fields of
studies such as ”Engineering”, ”Economics
and Legal”, ...
Table 2.1: Table Users in the RecSys2016 dataset
Table Items contains detailed information about job postings that have been and
will be suggested to various users. As described above, this data has been modified for
anonymization purposes. Table 2.2 describes the data fields in the Items table.
Table Interactions contains the interaction the user has made on the job posting.
There are four types of interactions:
1: The user clicked on the item
2: The user bookmarked the item
31
3: The user clicked on the reply button or application form button that is shown
on some job postings
4: The user deleted the recommendation from his/her list of recommendations
(clicking on ”x”) which has the effect that the recommendation will no longer
be shown to the user and that a new recommendation item will be loaded and
displayed to the user
Interactions 1, 2, and 3 are considered positive interactions (suitable job postings to
suggest to users). Interaction type 4 is considered negative interaction (job posting is
not suitable to suggest to users). Information about these interactions is described in
Table 2.3.
Table Impressions contains information about which jobs have been shown to which
users by XING’s existing recommendation system during which week of the year. There
are web, mobile, and email impressions. Impressions are not guaranteed to be on the
user’s screen. Information about these impressions is described in Table 2.4.
RecSys2016 dataset exploratory data analysis
This section conducts some analysis for more understanding of the characteristics of
the RecSys2016 dataset. Table 2.5 shows the dataset statistics.
The results show that 95% of the events are impressions and the probability of
interacting with these events is about 4%. Each item on average is interacted by users
with 8.57 times, which is a quite low number. Based on these statistics, 3 categories of
users can be distinguished:
New Users: The user only has information in the Users table but has not yet
shown the job posting. This category accounts for about 12% of the users.
Inactive Users: These are users who have information in the Users table and are
shown job postings but do not have any interaction. This category accounts for
about 16% of the users.
Active Users: These are users who have information in the Users table, are shown
job postings, and interact with them. This category accounts for about 72% of
the users.
In a similar way, the items can be divided into 2 different categories:
New Item: Job postings with information in the Items table but have not par-
ticipated in any events. This category accounts for about 25% of the number of
job postings.
Old Item: Job postings that have information in the Items table and have par-
ticipated in at least 1 event type. This category accounts for about 75% of the
job postings.
32
Figure 2.1: User’s university degree distribution in the RecSys2016 dataset
Figure 2.1 shows the distribution of the user’s university degree. The ”unknown”
value dominates others, which indicates that the user’s degree information is not trivial
to be collected/published. The ”master” degree surprisingly is the most popular and
accounts for 61.3% of the total number of non-unknown degrees. The ”bachelor” and
”phd” degrees accounted for 29.0% and 9.7% respectively. This shows that this is a
group of highly educated users, most of whom have received post-graduate training.
The distribution of the user’s total experience year is manifested in Figure 2.2. The
diagram shows that most users have total years of experience greater than or equal to
5 (93% of a total number of non-unknown experience years), which indicates this is a
senior user group.
Figure 2.3 shows the distribution of the user’s experience year in his/her current
job. Most users have less than 3 years of experience at their current job (53.3% of non-
unknown experience years), 3-10 years of current experience account for 34.5%, and this
number drops for more than 10 years of experience at their current job at 12.2%.
The distribution of the user’s number of experience entries is described in Figure
2.4. The number of experience entries in a resume that is greater than or equal to 5 is
in the majority (48.9%) compared to ”1-2 entries” (25.6%) and ”3-4 entries” (25.5%).
This can be explained quite clearly because this user group is senior, most of them
have a large number of total experience years (5-20 years), and the number of years of
experience at the current job is small (mostly smaller than or equal to 3 years).
Figure 2.5 shows the distribution of job employment type. ”full-time” dominates
the rest, proving that employers using the XING platform are looking for candidates
who work for the organization long-term, instead of other part-time positions.
33
Figure 2.2: User’s total experience year distribution in the RecSys2016 dataset
Figure 2.3: User’s current experience year distribution in the RecSys2016 dataset
34
Figure 2.4: User’s number of experience entry distribution in the RecSys2016 dataset
Figure 2.5: Job’s employment type distribution in the RecSys2016 dataset
35
Figure 2.6: Job’s active during test status distribution in the RecSys2016 dataset
Figure 2.7: User and job’s career level distribution in the RecSys2016 dataset
36
Figure 2.8: Top 5 user and job’s discipline id distribution in the RecSys2016 dataset
The distribution of user and job’s career level is shown in Figure 2.7. Although
the number of users is approximate with the number of items, there is a gap between
the users and the jobs. Although the ”Experienced” career level dominates the others
in both users and items categories: 1,070,631 experienced job positions accounting for
79.17% total number of jobs and 375,100 experienced users accounting for 47.50% total
number of non-unknown users, there is a big gap between these entities. Another
observation is that although most users have a large number of years of experience
(thenumberofusershaving 5 years of experience accounts for 87.91% of a total
number of non-unknown users) according to Figure 2.2. However, most users are at the
experienced level, which may be due to the fact that users spend a few years in one
position (according to Figure 2.5).
Figure 2.8 shows the top 5 users’ and the jobs’ discipline id distribution. The gap
between users and items can clearly be noticed. The number of jobs for discipline id 0
is the majority, but the number of users in this discipline id is almost zero and cannot
be accommodated. In contrast, the discipline ids of the users are mainly 15 and 16 but
lack jobs with the same discipline ids.
The distribution of the user and job’s industry id is described in Figure 2.9. The gap
between users’ and items’ industry ids can clearly be noticed. The number of users in
the industry ids 15 and 11 accounts for the majority, but the number of jobs in the same
industry ids is very small and cannot be met. In contrast, jobs mostly have discipline
0 but there are almost no users corresponding to that industry id.
Figure 2.10 shows the distribution of the user and job’s country. It can be seen that
the distribution of users and jobs is quite similar, and can basically satisfy each other.
37
Figure 2.9: Top 5 user and job’s industry id distribution in the RecSys2016 dataset
Figure 2.10: User and job’s country distribution in the RecSys2016 dataset
38
Figure 2.11: Interaction type distribution in the RecSys2016 dataset
most users and jobs in Germany since Xing is the platform of this country.
The distribution of interaction type is manifested in Figure 2.11. 81.4% interaction
type is ”click” which is reasonable because this kind of interaction is easily performed.
11.5% ”deleted” shows that the level of user engagement with the current recommender
system is not high.
RecSys2016 dataset problems
The data released for the challenge is data taken directly from the Xing portal, which
has been anonymized for privacy reasons. This characteristic allows to have a realistic
simulation of the job recommendation system, but leads to some problems:
Presence of incomplete data. During the data analysis, it was noticed that
some fields in the Users and Items tables had NULL values. This is a common
occurrence in real-world data as not all fields are mandatory and may not be filled
in.
Presence of abnormal data. During the data analysis, it was noticeable that
there were some interactions and impressions that did not have any connection
with their respective user or item in the corresponding tables.
Absence of the test set. The absence of a test set for calculating the score of
the recommendation system was identified as an issue from the beginning of this
chapter. This situation is potentially critical as it does not provide any verification
of the implemented recommendation system’s correctness.
39
Figure 2.12: CareerBuilder2012 data layout
2.2.2. CareerBuilder2012 dataset
The dataset contains data about users, job postings, and user applications for job
postings on the CareerBuilder.com website. These applications span a total of 13 weeks
and are divided into 7 groups, with each group corresponding to a period of 13 days.
Each of these time windows is further divided into two parts: the first 9 days constitute
the training period, while the last 4 days are the evaluation period. This division is
shown in Figure 2.12.
In the CareerBuilder.com dataset, each user and job posting is assigned to only
a one-time window randomly. The probability of a job posting being assigned to a
time window is proportional to the amount of time it has been present on the website
during that window. On the other hand, the probability of a user being assigned to a
time window is proportional to the number of job applications they have made during
that window. For example, if User 1 applies only to jobs in Time Window 1, they are
assigned to Time Window 1 with 100% probability. However, User 2 applied for jobs in
both Time Window 1 and Time Window 2, so they could have been assigned to either
Time Window 1 or 2.
During each time window, the dataset includes job postings that were applied to
by users within that time window during the 9-day training period. The users who
made 5 or more applications during the 4-day evaluation period are considered as the
evaluation set, while the remaining users are considered as the training set.
The problem at hand is to predict, for each time window, which job(s) from that
40
time window the set of users to evaluate applied for during the 4-day evaluation period.
It is important to note that while users may have applied for jobs from other windows,
the focus of the problem is solely on the jobs within the user’s own window. Therefore,
the goal is to develop a recommendation system that can accurately predict the job(s)
that users will apply for in their respective time windows based on their behavior during
the 9-day training period.
The CareerBuilder2012 dataset includes the five following main tables:
Table users, described in Table 2.6, contains information about users, where each
row describes a user.
The table user
history, described in Table 2.7, contains information about the
user’s work history, where each row describes a task the user has done.
The table jobs, described in Table 2.8, contains information about job postings,
where each row describes a job posting.
The apps table, described in Table 2.9, contains information about the user’s ap-
plications for the job, where each row describes an application.
The table window
dates, described in Table 2.10, contains information about time
windows, where each row describes a time window.
CareerBuilder2012 dataset exploratory data analysis
This section conducts some analysis for more understanding of the characteristics
of the CareerBuilder2012 dataset. There are 389708 users, 1091923 jobs, and 1603111
applications in the dataset.
Users can be distinguished into 2 categories:
Active users: These are users who have information in the Users table and have
applications in Apps table. This category accounts for 82.43% of the users.
Inactive users: These are users who have information in the Users table but do
not have applications in Apps table. This might include new users. This category
accounts for 17.57% of the users.
About jobs, only 365649 jobs equal 33.49% of the jobs that have been applied. Each
job on average is applied 4.38 times, which is a low number.
The distribution of the user’s degree type is shown in Figure 2.13. ”Bachelor’s” and
”High School” have the majority, respectively accounting for 35.99%, and 32.22% of
the number of not-None degree types. ”PhD” accounts for the least, only 0.014%. This
user’s degree type is more diverse than the RecSys2016 dataset’s, which also reveals
that the users in the CareerBuilder2012 dataset have less education level.
Figure 2.14 and 2.15 respectively list the top 20 majors and jobs of the users. As
expected, the majority of both users’ majors and jobs are business relevant. ”Busi-
ness”, ”Accounting”, ”Marketing”, ”Finance”, and ”Management” are well-studied by
users and they also align with the users’ jobs, which mostly are ”Customer Service”,
”Sales”, and ”Manager”. Besides that, ”Psychology”, ”Criminal Justice”, ”Nursing”,
41
Figure 2.13: User’s degree type distribution in the CareerBuilder dataset
Figure 2.14: Top 20 user’s major distribution in the CareerBuilder dataset
Figure 2.15: Top 20 user’s job distribution in the CareerBuilder dataset
42
Figure 2.16: User’s years from graduation distribution in the CareerBuilder dataset
”Computer Science” and ”Education” majors are quite popular but not in the top 20
users’ jobs in the top 20, except only ”Medical Assistant” in the 13th is associated with
”Nursing”.
The distribution of user’s years from graduation and total experience years are man-
ifested in Figure 2.16 and 2.17. Since the largest value of the total experience years of
a user is 112, which obviously is a noise and the total experience years larger than 40 is
minor compared to the remaining, the figure only shows the total experience years that
are smaller than or equal to 40. There are negative values in years from graduation,
which indicates the candidate is still not graduated. The two figures are somewhat
aligned with each other since most of the users in the dataset are newly graduated and
also have few years of experience.
Figure 2.18 shows that 42.2% users were currently not working for any organization
at the time. This is a quite large number since Figure 2.17 indicates that most users
had somewhat experienced. This means that the users might quit their job beforehand
or just set the status to not currently employed for new job searching purposes.
One out of four experienced users has managed others according to 2.19, which is a
quite high ratio.
Figure 2.20 lists the top 20 job titles of the items. This is highly correct with the
top 20 users’ jobs in Figure 2.15, in which business-relevant jobs such as: ”Customer
Service”, ”Sales”, and ”Manager” occupy most of the list.
99.7% of users and 99.9% of jobs are in the US country. Following Figure 2.21, the
ratio of users to jobs across each US state is quite similar and approximately equal to
the ratio of total users to total jobs, which is approximately 2.8.
43
Figure 2.17: User’s total experience years ( 40) distribution in the CareerBuilder
dataset
Figure 2.18: User’s CurrentlyEmployed status distribution in the CareerBuilder dataset
44
Figure 2.19: User’s ManagedOthers status distribution in the CareerBuilder dataset
Figure 2.20: Top 20 item’s job titles in the CareerBuilder dataset
45
Figure 2.21: User and item’s US state distribution in the CareerBuilder dataset
CareerBuilder2012 dataset problems
The data released for the challenge is taken from the CareerBuilder portal. Similar
to RecSys2016, this dataset also has some problems as followings:
Lacks of negative sample. The data contains only applications between users
and jobs, which are considered positive samples.
Hard-to-extract information from textual data. The data contains a large
amount of textual data but lacks information extracted from it such as skill,
salary,...
Presence of incomplete data. During the data analysis, it was noticeable that
some fields in the Users and Items tables had NULL values. This is a common
occurrence in real-world data as not all fields are mandatory and may not be filled
in. To handle this, a default value was chosen for each field to replace the NULL
value.
Presence of abnormal data. During the data analysis, there were some appli-
cations that did not have any connection with their respective user or item in the
corresponding tables. These instances were deleted during development as it was
not possible to infer any useful information about the user or item they referred
to.
Absence of the test set. The absence of a test set for calculating the score of
the recommendation system was identified as an issue from the beginning of this
chapter. This situation is potentially critical as it does not provide any verification
of the implemented recommendation system’s correctness.
2.3. Data labeling
In the RecSys2016 dataset, the interaction time is from 19/08/2015 to 08/11/2015.
First, the last week of the seven-week data is treated as the hold-out test set by
using 02/11/2015 as the cut-off time. The remaining data is the training set.
Since the ”active
during test” field in the Interactions table indicates if the jobs
are still available and recommendable to users, items with False ”active
during test”
are removed from the test set. And recommendation in this thesis only recommends
jobs with the ”active
during test” field being True.
46
Figure 2.22: Item’s active status distribution in the CareerBuilder dataset
The Impression table contains the actual items that had been recommended to
users but we could not use it for testing purposes since it might bias the results.
Finally, the RecSys2016 data contains:
training set: 744337 unique users, 971343 unique items, and 8121088 interactions.
test set: 118755 unique users, 105077 items, and 486981 interactions, in which
83845 users and 72375 items have interactions in the training set. The rest is
considered new users and items.
About the CareerBuilder2012 dataset, the application’s time is from 01/04/2012 to
26/06/2012.
Like above, 20/06/2012 is chosen to be the cut-off time to split the last week’s data
as a hold-out test set. The remaining data from 12 weeks are treated as a training set.
Since an item having the ”EndDate” field smaller than the cut-off time is no longer
available and recommendable to users in the test set who are after the cut-off time,
these items are removed from the test set. The jobs with the ”EndDate” field greater
than the cut-off time are treated as ”Active” and have distribution shown in Figure
2.22.
Finally, the CareerBuilder data contains:
training set: 301116 unique users, 352836 unique items, and 1494914 interactions.
test set: 35315 unique users, 34755 unique items, and 108162 interactions, in
which 15200 users and 21944 items have interactions in the training set. The rest
is considered new users and items.
47
2.4. Data preprocessing
In the RecSys2016 dataset, 3 tables are used, including the Users table, Items table,
and Interactions table. As mentioned in subsection 2.2.1, there is a small fragment
of NULL values, which are further removed from these tables; the interactions having
users or items that are not in Users and Items table are also erased.
Interaction type ”click” - 1, ”bookmark” - 2, reply” - 3, and ”delete” - 4, intuitively
represents the level of interest of the user in specific job postings. ”delete” interaction
type is mapped from 4 to 0 so that interaction type is in the correct order in term of
level of interest.
Anonymized textual fields in one table (”jobroles” field in Users table; ”title” and
”tags” Items table) then have their commas replaced by spaces and concatenate into
a new ”text
feature” field for further embedding. An example of this is shown in Table
2.11.
For the CareerBuilder2012 dataset, the Users, User
history, Jobs,andApps
table are used. As mentioned in subsection 2.2.2, there are some NULL values among
the fields in these tables, which are further replaced with the default value chosen for
each field in these tables. The default value is an empty string for textual or categorical
fields and is the mean value for numerical fields. The applications having users or items
that are not in Users and Items table are also deleted.
Each user may have many history jobs in User
history table. First, we aggregate
these jobs so that each user has a list of jobs split by spaces. Then, User
history table
is joined with Users table using the UserID field as the key.
Again, textual fields in the Users and Jobs table are concatenate:
”Major” and ”JobTitle” fields in the Users table are concatenated into a new
”text
feature” field.
”Title”, ”Description”, and ”Requirements” in the Items table are concatenated
into a new ”text
feature” field.
The textual data is dirty with HTML tags, arbitrary capitalization, numbers, and
special symbols,... This data type is cleaned by the following process:
Remove HTML tags
Make lower
Remove punctuation
Remove stop words (common words in a language that are often used but carry
little or no meaning on their own, such as ”the,” ”and,” ”a,” ”an,”, ”in”,...)
Remove numbers
48
Lemmatize, which reduces a word to its base or dictionary form, is called a lemma.
For example, lemmatization would reduce the words ”am”, ”are”, and ”is” to their
shared lemma ”be”.
For example, this is a ”Description” value from the Items table before cleaning:
<div style="text-align: center"><span style="text-decoration:
underline"><strong><em>Assistant Managers and General
Managers</em></strong></span><span style="text-decoration:
underline"> <br>
˚
</span><strong></strong></div>
˚
<p
align="center"><strong>Milo&rsquo;s Hamburgers</strong> has been
in business since 1947 and has 15 location in the Birmingham area.
Milo&rsquo;s is one of Birmingham&rsquo;s treasures. They&nbsp;have
been a&nbsp;fabric of the community for 64 years and have the
most passionate and loyal customers in the restaurant business.
</p>
˚
<div>&nbsp;</div>
˚
<p align="center"><strong>Milo&rsquo;s</strong>
employees are offered opportunities to develop and grow in a company
that is very well established. Our benefit package ranks at the
top of the industry with competitive salary programs, cost affective
medical insurance programs and 401K company matching plan are just a
few our benefits.</p>
After the data cleaning process, it becomes:
assistant manager general manager r rmilo’s hamburger business
since location birmingham area milo’s one birmingham’s treasure
fabric community year passionate loyal customer restaurant business
r rmilo’s employee offered opportunity develop grow company well
established benefit package rank top industry competitive salary
program cost affective medical insurance program company matching
plan benefit
2.5. Recommendation model implementation
2.5.1. Item popularity approach
This is a baseline approach. This model is not personalized - it simply recommends to
a user the most popular items that the user has not previously consumed. As popularity
accounts for the ”wisdom of the crowds”, it usually provides good recommendations,
generally interesting for most people.
With the RecSys2016 dataset, users are allowed to view a job many times. Users can
interact with them in different ways (eg. ”click”, ”bookmark”, ”reply”, and ”delete”).
Thus, the popularity score is obtained as following steps:
For a given job, to model the interest of one user, all the interactions the user has
performed with that item are aggregated by a sum of interaction type strength
49
Figure 2.23: Popularity score calculation illustration
and then apply a log transformation at Equation 2.5.1 to smooth the distribution,
which helps manage the impact of too many interactions between a particular
user-item pair.
The popularity of the given job equals some of all users’ interest in that job.
The example of popularity score calculation of 2 items and 3 users interacting
with them is described in Figure 2.23.
x log
2
(1 + x) (2.5.1)
In the CareerBuilder2012 dataset, a similar method is applied to calculate the pop-
ularity score of an item but the smooth function is not used since one user only applies
for one particular job once in this dataset. Every target user in both datasets is then
recommended with the top k jobs with the highest popularity score.
2.5.2. User-item matching approach
Since both users and items are used compared to each other, they must be embedded
in the same vector space. The same feature set between users and items must be used
for this sake. In both datasets, data is divided into 2 types:
textual data: The ”text
feature” created in both datasets in data preprocess 2.4.
non-textual data:
In RecSys2016, data can be further divided into: categorical (”discipline
id”,
”industry
id”, ”region”, ”country” fields) and numerical data (”career level”
field).
In CareerBuilder2012, there is only categorical data (”City”, ”State”, ”Coun-
try”).
To get the representation of non-textual data, the categorical data embedding is
obtained using OneHotEncoder, then it is concatenated with numerical data, and finally
50
apply with MinMaxScaler to normalize the data since one-hot vector and numerical
vector are not in the same scale yet. With textual data, TFIDF is applied to create
20000-dimension vectors for this data type.
The similarity between a user and a job is then calculated using a combination of 2
cosine similarities:
overall
similarity = w
1
× cosine(ntv
u
,ntv
i
)+w
2
× cosine(tv
u
,tv
i
) (2.5.2)
in which:
w
1
, w
2
are weights of each partial similarity.
ntv
u
,andntv
i
are respectively vector representations of non-textual data of users
and items.
tv
u
,andtv
i
are respectively vector representations of textual data of users and
items.
(w
1
,w
2
) is then experimented from (0.0, 1.0) to (1.0, 0.0) with changing speed (0.1, 0.1).
Finally, every target user is then recommended with the top k jobs having the highest
overall
similarity with the user.
As mentioned in section 1.2, the item popularity approach and user-item match-
ing approach can be used to cope with the cold-start problem since they are able to
recommend items to any target users, even the new ones with no interactions. For
that reason, these methods are suitable to be integrated with other approaches to make
a hybrid recommendation approach. The three following recommendation approaches
suffer from the cold start problem and can only make recommendations for users who
interacted in the past so a switching hybrid strategy utilizing the same item popularity
approach is used for comparison between the three primary methods purpose.
2.5.3. Content-based with item popularity approach
Similar to the user-item matching approach, the content-based recommendation sys-
tem calculates the similarity between entities (items). The overall
similarity calcula-
tion is exactly as in the matching approach. The only difference is that only items’ data
is used:
In the RecSys2016 dataset, there is one more categorical data field (”employ-
ment”) in the Items table to be used.
In the CareerBuilder2012 dataset, the data fields to be used remain.
Equation 2.5.2 is used and (w
1
, w
2
) is also experimented from (0.0, 1.0) to (1.0, 0.0)
with changing speed (0.1, 0.1).
The content-based method requires the target users to have interactions in the past
to make recommendations. For that reason, the target users who interacted are then
51
Figure 2.24: Simple collaborative filtering illustration
recommended the items that are most similar to the last item they positively interacted
with (not ”delete”). With the new users who do not have any positive interactions,
the item popularity method in the previous approach is used to recommend these users
with the item popularity approach’s results.
2.5.4. Collaborative filtering with item popularity approach
This approach is solely based on interactions between users and items. First, an
interaction matrix between users and items is created for both datasets. One element
in the matrix (which is interaction type) in the RecSys2016 and CareerBuilder2012
datasets is respectively in (0, 1, 2, 3) and (0,1). Each user is then embedded as a row of
that matrix, which has the vector’s length equal to the number of items in the system.
KNN model is used with K = 5000 nearest neighbors and cosine as the similarity
metric between user entities. After obtaining n nearest neighbors of a target user, the
target user’s prediction vector is the calculation of the mean of its neighbors’ corre-
sponding vectors in the interaction matrix. The target user can then be recommended
the corresponding items with the highest value in the prediction vector.
A simple collaborative filtering approach is described in Figure 2.24. Here the system
contains 6 users (u
1 u 6) and 5 items (i 1 i 5). Each user is then represented as
a row of the interaction matrix with unseen interactions replaced with a default value
(0 in this case). For example, the user u
3’s embedding is (1, 2, 0, 0, 0). From these
vectors, KNN can be applied to get the nearest neighbors of the target user. With
k = 2, assume that u
1andu 5arek nearest neighbors of the user u 3. Prediction
vector u
3 is then obtained by averaging its k = 2 nearest neighbors, which results in
(1, 1, 2, 1.5, 0). From here, unseen interactions between the user u
3 and items i 3, i 4
can be predicted so that recommendations could be made.
Collaborative filtering also suffers from the cold-start problem which requires the
target users to have interactions in the past to make recommendations. Because of
this, with the new users who do not have any positive interactions, a switching hybrid
strategy is used to recommend these users with the item popularity approach’s results.
52
Figure 2.25: Bipartite graph demonstration in recommendation systems
2.5.5. Graph Neural Network with item popularity approach
The recommendation system can be considered a link prediction problem in Graph
Neural Networks (GNN) where the GNN model learns to predict the likelihood of the
existence of an edge between two nodes in a graph based on their features and the
graph structure. Link prediction is trained by comparing the score of an edge (positive
example) against a non-existent edge (negative example). For example, given an edge
connecting u and v, the score between node u and v is encouraged to be higher than the
score between node u and a sampled node v
from a distribution P
n
(v). This technique
is called negative sampling.
To model user-item relationships, a bipartite graph and a Relational Graph Convo-
lution Network (RGCN) model are utilized. A bipartite graph is a type of graph that
consists of two sets of nodes, where nodes in one set are connected to nodes in the other
set, but not to other nodes within the same set as demonstrated in 2.25. In the context
of job recommendation systems, one set typically represents the users, while the other
set represents the jobs. Edges between the two sets indicate a user-item interaction and
the strength of the edge may indicate the strength of the interaction. An RGCN model
is a type of Graph Convolutional Network that is designed to work with heterogeneous
graphs. In RGCN, a separate graph is created for each relation type.
In Graph Convolutional Network, the Equation of message passing and aggregation
53
at 1.1.5 becomes:
h
(l)
v
= f
(l)
W
(l)
·
u∈N (v)
h
(l1)
uv
|N (v)|
+ B
(l)
· h
(l1)
v
(2.5.3)
in which:
h
(l)
v
is the hidden representation of node v at layer l.
h
(l1)
uv
is the hidden representation of node u which is a neighbor of node v.
N (v) is the set of neighbors of node v.
W
(l)
and B
(l)
are learnable parameters.
f
(l)
is an activation function.
For each step l, the function f
(l)
, matrices W
(l)
and B
(l)
are shared across all nodes.
The heterogeneous graph constructed in both datasets has the same node and edge
types but different sizes: The graph from the RecSys2016 dataset:
RecSys2016:
Graph(
num_nodes={’item’: 967688, ’user’: 743410},
num_edges={
(’item’, ’interacted_by’, ’user’): 8090622,
(’user’, ’interacted’, ’item’): 8090622
},
metagraph=[
(’item’, ’user’, ’interacted_by’),
(’user’, ’item’, ’interacted’)
]
)
CareerBuilder2012:
Graph(
num_nodes={’item’: 352836, ’user’: 301116},
num_edges={
(’item’, ’interacted_by’, ’user’): 1494914,
(’user’, ’interacted’, ’item’): 1494914
},
metagraph=[
(’item’, ’user’, ’interacted_by’),
(’user’, ’item’, ’interacted’)
]
)
54
The architecture of RGCN for both datasets is as follows:
Model(
(rgcn): StochasticTwoLayerRGCN(
(conv1): HeteroGraphConv(
(mods): ModuleDict(
(interacted_by): GraphConv(in=300, out=256, normalization=right)
(interacted): GraphConv(in=300, out=256, normalization=right)
)
)
(conv2): HeteroGraphConv(
(mods): ModuleDict(
(interacted_by): GraphConv(in=256, out=128, normalization=right)
(interacted): GraphConv(in=256, out=128, normalization=right)
)
)
)
(pred): ScorePredictor()
)
in which:
The StochasticTwoLayerRGCN module is a two-layer heterogeneous graph convo-
lutional network. The module consists of two HeteroGraphConv layers: conv1 and
conv2. Each HeteroGraphConv layer is a dictionary mapping from the edge types
to GraphConv layers with input and output sizes specified by the hidden
feat
and out
feat parameters. In the given implementation, the GraphConv layers
are initialized with the right normalization, which computes the row-normalized
adjacency matrix.
The ScorePredictor module takes as input an edge subgraph and node features
and outputs a score for each edge in the subgraph by computing the dot product
of the node features of the source and destination nodes of each edge.
300-dimensional initial feature vectors of user and item nodes are obtained with 2
different approaches:
In RecSys2016, textual data is anonymized to IDs. TFIDF is used to embed these
values into 20000-dimension vectors. SVD is then used to further embed these
vectors into 300-dimensional space. TFIDF with ”max
features” equal 300 was
not used because the information might drop significantly.
In CareerBuilder2012, FastText is used to embed textual data into 300-dimensional
space. FastText is an extension of the word2vec algorithm for learning word em-
beddings. It uses a skip-gram model, which predicts the surrounding words of a
55
given word to learn efficient word embeddings. FastText also considers character
n-grams in addition to words, which helps to learn embeddings for rare words and
out-of-vocabulary words.
In the RecSys2016 dataset, the number of ”delete” interactions is small and not
used for training. In both datasets, ”Uniform” negative sampling is used to randomly
sample nodes from the graph, with equal probability for each node.
Hinge loss is chosen as the loss function of this GNN model, which aims to maximize
the score between positive samples and negative samples. Pytorch implementation of
this loss looks like the following:
def compute
loss(pos score , neg score):
loss = (neg
score .view(n, 1) pos score .view(n, 1) + 1)
. clamp(min=0).mean()
return loss
The RGCN model is then trained with 500 epochs for both datasets. The node (user
and item) embedding is then obtained from trained GNN. To get the recommendation,
the dot product between the user embedding and each item embedding is calculated to
get a score for each item and the items with the highest scores are recommended.
The GNN approach can only utilize users and items that exist in the graph for
making recommendations. Thus, the target users who interacted are then recommended
using the GNN approach. With the new users who do not have any interactions, the
same switching hybrid strategy in content-based and collaborative filtering approach
is used to recommend these users with the item popularity method’s results since the
sub-methods used in the hybrid approach must be the same for the main methods’
comparison sake.
Across all models, the recommendation output is a dictionary with target users
as keys and a corresponding list of recommendation items as values. In the Career-
Builder2012 dataset, this result is further fine-granted by removing items that are inter-
acted with by the target user since a user in this dataset can only apply for the job once.
On the other hand, the recommendation result remains in the RecSys2016 dataset as
the users in this dataset could interact with an item multiple times.
56
Table 2.2: Table Items in the RecSys2016 dataset
Field name Type Description
id Int Anonymized ID of the item
title Text Concepts (numeric IDs) that have been ex-
tracted from the job title of the job posting
career level Int career level ID. Meaning of numbers are the
same as career
level in the table User
discipline id Int Anonymized IDs represent disciplines such as
”Consulting”, ”HR”, etc.
industry id Int Anonymized IDs represent industries such as
”Internet”, ”Automotive”, ”Finance”, etc.
country Text Code of the country in which the job is offered
region Int Specified for some users who have as country de
latitude Float latitude information (rounded to ca. 10km)
longitude Float longitude information (rounded to ca. 10km)
employment Int The type of employment:
0: unknown
1: full-time
2: part-time
3: freelancer
4: intern
5: voluntary
tags Text Concepts that have been extracted from the
tags, skills, or company name
created at Int A Unix time stamp timestamp representing the
time when the interaction got created
active during test Int 1 if the item is still active (= recommendable)
during the test period and 0 if the item is not
active anymore in the test period (= not recom-
mendable)
57
Table 2.3: Table Interactions in the RecSys2016 dataset
Field name Type Description
user id Int ID of the user who performed the interaction
item id Int ID of the item on which the interaction was per-
formed
interaction type Int The type of interaction that was performed on
the item
created at Int A Unix time stamp timestamp representing the
time when the interaction got created
Table 2.4: Table Impressions in the RecSys2016 dataset
Field name Type Description
user id Int ID of the user
items Int A comma-separated list of items that were dis-
played to the user
year Text Year the job posting was displayed to the user
week Int Week of the year that job posting was displayed
to the user
Table 2.5: RecSys2016 dataset statistics
Event type #events #users #items
Click 7,183,038 769,396 998,424
Bookmark 206,191 59,063 142,908
Reply 422,026 107,463 190,099
Delete 1,015,423 44,595 215,844
Impression 201,872,093 2,755,167 846,814
- - 1,500,000 1,358,098
58
Table 2.6: Table users in CareerBuilder2012 dataset
Field name Type Description
UserID Int ID of the user
WindowID Int ID of the window time
Split String User belongs to training or test set
City String User’s City
State String User’s State
Country String User’s Country
ZipCode Int User’s postal code
DegreeType String User’s Degree Type
Major String User’s Major
GraduationDate Datetime User’s Graduation Date
WorkHistoryCount Int Amount of work the user has done
TotalYearsExperience Int Total years of user experience
CurrentlyEmployed Boolean Is the user currently working?
ManagedOthers Boolean Is the user managing others
ManagedHowMany Int Number of people being managed by the user
Table 2.7: Table user history in CareerBuilder2012 dataset
Field name Type Description
UserID Int ID of the user
WindowID Int ID of the window time
Split String User belongs to training or test set
Sequence Int Order of work done by a user, smaller order in-
dicates more recent work
JobTitle String Title of the job
Table 2.8: Table jobs in CareerBuilder2012 dataset
Field name Type Description
JobID Int ID of the job
WindowID Int ID of the window time
Title String Title of the job
Description String Job description
Requirements String Job requirements
City String City of job postings
State String State of job postings
Country String Country of job posting
Zip5 Int Postal code of job posting
StartDate Datetime Job postings start showing up on Career-
builder.com
EndDate Datetime Job posting time no longer shows on Career-
builder.com
59
Table 2.9: Table apps in CareerBuilder2012 dataset
Field name Type Description
UserID Int ID of the user
WindowID Int ID of the window time
Split String User belongs to training or test set
ApplicationDate Datetime Time the user applied for the corresponding job
advertisement
JobID Int ID of the job
Table 2.10: Table window dates in CareerBuilder2012 dataset
Field name Type Description
Window Int ID of the window time
Train Start Datetime The start time of the training phase
Train End / Test Start Datetime Time to end the training period and start the
test period
Test End Datetime Test period end time
Table 2.11: Anonymized textual fields processing example
title tags = text feature
4298526,
4316979
1471052,
2072458,
2512557
= 4298526
4316979
1471052
2072458
2512557
60
Chapter 3
Experiment s and results
All different algorithms in this thesis are evaluated with Map@k and RecSys2016
Score (RSScore), which are described in the following subsections.
3.1. Evaluation metrics
3.1.1. Map@k
MAP@k (Mean Average Precision at k) is a metric used to evaluate the performance
of ranking algorithms, including information retrieval and recommendation systems. It
measures the average precision of the top-k-ranked items recommended by a system,
where k is a positive integer.
MAP@k =
1
U
U
u=1
AP
u
@k =
1
U
U
u=1
1
min(n, k)
k
k=1
P
u
@k × rel
u
(k) (3.1.1)
The MAP@k metric is calculated in 3.1.1, in which:
U is the number of users.
n is the number of relevant items.
P
u
@k is the precision within the first top k items of user u.
rel
u
(k)=
1ifk
th
item is relevant with user u
0 otherwise
MAP@k is a useful metric because it takes into account both the order and the
relevance of the recommended items, and it is sensitive to the number of relevant items
in the query. A higher MAP@k score indicates better performance of the ranking
algorithm.
In this thesis, MAP@1, MAP@5, MAP@10, MAP@30, and MAP@150 are ap-
plied. MAP@k with k 10 is a reasonable metric since an interface screen can intu-
itively contain up to 10 items for the user to interact. MAP@150 was also used in the
61
Kaggle Job Recommendation Challenge competition [3]. This metric is used to measure
the performance of the model in recommending a big pool of potential items.
3.1.2. RecSys2016 Score
The RecSys2016 challenge designed an evaluation measure that reflects typical use
cases at XING: users are presented with their top-k personalized recommendations, and
user interaction with one of the top-k is counted as a success. The task of the original
challenge is to compute 30 recommendations (or less) for each of the 150,000 target
users. In particular, the algorithms have to predict those items that a user will interact
with. The original evaluation measure equation of the ACM RecSys Challenge 2016
sum all the individual user’s score in 3.1.2.
RS2016EvalM easure =
U
u=1
20 × (P
u
@2 + P
u
@4 + R
u
@30 + Success
u
@30)
+10× (P
u
@6 + P
u
@20)
(3.1.2)
in which:
P
u
@k is the precision within the first top k items of user u.
R
u
@30 is the recall with all recommendation items of user u.
Success
u
@30 =
1 if user u interact with item(s) from top 30 recommendation
0 otherwise
U is the number of target users.
This is a comprehensive evaluation metric that aims at recommending the top 30 most
relevant items in the ACM RecSys Challenge 2016. However, the original task of the
challenge has a fixed 150,000 target users as the test set which is unable to be obtained.
For that reason, the original RecSys2016 evaluation measure is not suitable for this thesis
since the numbers of target users in the test sets are different. Therefore, this thesis
further divides that scoring function by the total target users. RSScore’s equation is
described in 3.1.3.
RSScore =
1
U
U
u=1
20 × (P
u
@2 + P
u
@4 + R
u
@30 + Success
u
@30)
+10× (P
u
@6 + P
u
@20)
(3.1.3)
3.2. Results and discussion
Before going into any results and discussion, it is important to notice the fact that
the results of the recommendation system are affected by the items shown to users
62
Table 3.1: Performance of Matching model with different weights in the Recsys2016
dataset
(w1, w2) Map@1 Map@5 Map@10 Map@30 Map@150 RSScore
(0.0, 1.0) 0.0018 0.0018 0.0020 0.0024 0.0028 0.7715
(0.1, 0.9) 0.0018 0.0018 0.0020 0.0024 0.0028 0.7671
(0.2, 0.8) 0.0018 0.0018 0.0020 0.0024 0.0028 0.7677
(0.3, 0.7) 0.0017 0.0017 0.0019 0.0023 0.0026 0.7457
(0.4, 0.6) 0.0015 0.0014 0.0016 0.0019 0.0023 0.6551
(0.5, 0.5) 0.0011 0.0010 0.0012 0.0014 0.0016 0.4970
(0.6, 0.4) 0.0005 0.0005 0.0006 0.0007 0.0008 0.2715
(0.7, 0.3) 0.0002 0.0002 0.0002 0.0003 0.0003 0.1032
(0.8, 0.2) 0.0001 0.0001 0.0001 0.0001 0.0001 0.0376
(0.9, 0.1) 0.0001 0.0000 0.0000 0.0001 0.0001 0.0328
(1.0, 0.0) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0067
which are called impressions since the users can only interact with a finite number of
impressions. If a job post never shows in the impressions of the system, it is not possible
to collect any interaction between the user and the post. In RecSys2016, impressions
are generated by the existing recommendation system. In CareerBuilder2012, there
is no information about how impressions shown to users are made. For that reason,
the existing recommendation algorithms or baselines can drive the performance of the
recommendation models in the thesis.
The user-item matching approach’s performance with different weights using the
RecSys2016 and CareerBuilder2012 dataset are respectively shown in Table 3.1 and 3.2.
Surprisingly, in RecSys2016, even utilizing some seem-to-be informative non-textual
data such as: ”discipline
id”, ”industry id”, and ”career level” beside geographical data
like ”region” and ”country”, the model that was constructed based solely on textual
data has the best performance in all evaluation metrics. This might be due to the big
gaps between users and items in ”discipline
id”, ”industry id”, and ”career level” men-
tioned in the RecSys2016 dataset analysis. In CareerBuilder2012, (w1,w2) = (0.5, 0.5)
gives the highest performance, which means both non-textual and textual data are im-
portant for the recommendation although the non-textual data in CareerBuilder2012
only contains geographical data, including ”City”, ”State”, ”Country”, and not other
seem-to-be informative data as RecSys2016’s. The reason for the impact of geographi-
cal data on the performance in this dataset might be that the existing algorithms used
in CareerBuilder that generate the impressions also utilized this data type, especially
when the competition suggested using geographical data for its baseline approach.
Table 3.3 and 3.4 indicate the content-based recommendation system’s evaluation
metrics with different weights. The model constructed using the RecSys2016 dataset
improves significantly compared to the user-item matching approach. The optimal (w1,
w2) pair is (0.1, 0.9), which is not so different from the one in the matching approach
and still utilizes mostly textual data. This reveals that there might be a semantic
63
Table 3.2: Performance of Matching model with different weights in the Career-
Builder2012 dataset
(w1, w2) Map@1 Map@5 Map@10 Map@30 Map@150 RSScore
(0.0, 1.0) 0.0008 0.0008 0.0010 0.0012 0.0015 0.5014
(0.1, 0.9) 0.0023 0.0020 0.0023 0.0027 0.0032 0.9731
(0.2, 0.8) 0.0049 0.0047 0.0053 0.0062 0.0073 2.1135
(0.3, 0.7) 0.0091 0.0092 0.0103 0.0121 0.0137 3.7859
(0.4, 0.6) 0.0120 0.0126 0.0141 0.0162 0.0181 4.7051
(0.5, 0.5) 0.0131 0.0138 0.0153 0.0173 0.0193 4.7470
(0.6, 0.4) 0.0129 0.0130 0.0144 0.0163 0.0183 4.4659
(0.7, 0.3) 0.0127 0.0127 0.0141 0.0159 0.0178 4.2817
(0.8, 0.2) 0.0126 0.0126 0.0138 0.0156 0.0175 4.1917
(0.9, 0.1) 0.0126 0.0126 0.0139 0.0156 0.0176 4.1839
(1.0, 0.0) 0.0027 0.0028 0.0033 0.0041 0.0050 1.6564
gap as discussed in the Introduction between users’ profiles and jobs’ descriptions in
the dataset, which makes different entity type (user-item) comparisons less effective
than using one kind of entity type itself (item-item). This is reasonable since the
textual data in the RecSys2016 dataset is anonymized into ids so words with the same
meaning but in different forms are treated as totally distinct. On the other hand, in
the CareerBuilder2012 dataset, the performance of the model drops quite dramatically.
The optimal (w1, w2) pair is (0.6, 0.4) which does not change much from (0.5, 0.5) in
the matching model and the data fields used also persist. This is because the content-
based method can only make recommendations for only 15200/35315 target users who
applied for at least one job in the past in the CareerBuilder2012 dataset. The remaining
20115 new target users are recommended with the item popularity algorithm which is
not effective as the matching approach in this case. The number of new target users in
the RecSys2016 dataset (29%) is much smaller than in the CareerBuilder2012 dataset
(57%) so it hurts the performance of the content-based approach less.
The loss curves during a 500-epoch training of the RGCN model in the RecSys2016
and CareerBuilder2012 dataset are respectively shown in Figure 3.1 and 3.2. The curves
indicate that the two models have converged after 500 epochs.
In the RecSys2016 dataset, the content-based method which is generally the best
model is further combined with the matching method instead of item popularity. The
overall results of different approaches are manifested in Table 3.5. At first glance, there
is no specific algorithm dominating the others on every evaluation metric. The item
popularity method is not personalized but still achieves better results than the matching
one. The matching algorithm has a better Map@1 than the item popularity approach
but surprisingly the item popularity one is better with the other metrics. Since the item
popularity method is better than the matching one in general which might be due to the
existing recommendation algorithms in the platform, this facilitates the corresponding
hybrid approach. That’s why hybrid approach using the item popularity method is
64
Figure 3.1: RGCN’s loss curve during training in RecSys2016
Figure 3.2: RGCN’s loss curve during training in CareerBuilder2012
65
Table 3.3: Performance of Content-based model with different weights in the Recsys2016
dataset
(w1, w2) Map@1 Map@5 Map@10 Map@30 Map@150 RSScore
(0.0, 1.0) 0.0539 0.0352 0.0346 0.0354 0.0360 4.5402
(0.1, 0.9) 0.0554 0.0356 0.0350 0.0358 0.0364 4.5414
(0.2, 0.8) 0.0552 0.0353 0.0348 0.0356 0.0362 4.4921
(0.3, 0.7) 0.0549 0.0349 0.0344 0.0351 0.0356 4.3887
(0.4, 0.6) 0.0543 0.0342 0.0337 0.0343 0.0347 4.1898
(0.5, 0.5) 0.0532 0.0331 0.0325 0.0330 0.0333 3.8986
(0.6, 0.4) 0.0515 0.0315 0.0308 0.0312 0.0315 3.5835
(0.7, 0.3) 0.0506 0.0304 0.0297 0.0301 0.0303 3.3722
(0.8, 0.2) 0.0503 0.0302 0.0295 0.0299 0.0301 3.3133
(0.9, 0.1) 0.0502 0.0301 0.0294 0.0298 0.0300 3.2988
(1.0, 0.0) 0.0028 0.0047 0.0052 0.0059 0.0062 1.5364
Table 3.4: Performance of Content-based model with different weights in the Career-
Builder2012 dataset
(w1, w2) Map@1 Map@5 Map@10 Map@30 Map@150 RSScore
(0.0, 1.0) 0.0029 0.0023 0.0024 0.0025 0.0027 0.4866
(0.1, 0.9) 0.0040 0.0033 0.0034 0.0037 0.0040 0.7775
(0.2, 0.8) 0.0049 0.0043 0.0045 0.0050 0.0055 1.1730
(0.3, 0.7) 0.0052 0.0048 0.0052 0.0058 0.0065 1.4730
(0.4, 0.6) 0.0056 0.0051 0.0056 0.0062 0.0069 1.5932
(0.5, 0.5) 0.0058 0.0054 0.0058 0.0064 0.0071 1.6388
(0.6, 0.4) 0.0061 0.0056 0.0061 0.0068 0.0075 1.7312
(0.7, 0.3) 0.0060 0.0056 0.0061 0.0068 0.0075 1.7314
(0.8, 0.2) 0.0060 0.0057 0.0062 0.0068 0.0076 1.7336
(0.9, 0.1) 0.0060 0.0057 0.0062 0.0068 0.0075 1.7258
(1.0, 0.0) 0.0007 0.0012 0.0014 0.0017 0.0021 0.6035
better than the one using matching approach. The collaborative filtering algorithm
achieves the best RSScore =6.7410 but the content-based one is the best at other
evaluation metrics, which might indicate that collaborative is better for recommending
a 30-item list in general. The graph neural network approach is slightly better than the
matching algorithm, but unexpectedly falls behind every other model. One common
metric used to determine the graph’s sparseness is the edge density. Edge density is
defined as the ratio of the number of edges in the graph to the maximum number of
edges possible in the graph. To determine the strength of connectivity between nodes
within a graph, the node’s degree metric which is the number of edges connected to it
is used. The reason the GNN model performs poorly for this task might be because the
graph is too sparse and the connectivity between the user and item nodes is too weak
since the edge density of the graph is 0.0011% which is too poor and the average user
66
Table 3.5: Performance of different models in the RecSys2016 dataset
Model Map@1 Map@5 Map@10 Map@30 Map@150 RSScore
Popularity 0.0001 0.0069 0.0074 0.0092 0.0098 2.7400
Matching 0.0018 0.0018 0.0020 0.0024 0.0028 0.7715
CB + Popularity 0.0554 0.0356 0.0350 0.0358 0.0364 4.5414
CF + Popularity 0.0327 0.0304 0.0317 0.0341 0.0355 6.7410
GNN + Popularity 0.0001 0.0024 0.0026 0.0033 0.0040 1.1577
CB + Matching 0.0561 0.0340 0.0333 0.0337 0.0342 3.9489
Table 3.6: Performance of different models in the CareerBuilder2012 dataset
Model Map@1 Map@5 Map@10 Map@30 Map@150 RSScore
Popularity 0.0001 0.0003 0.0003 0.0004 0.0005 0.1248
Matching 0.0131 0.0138 0.0153 0.0173 0.0193 4.7470
CB + Popularity 0.0061 0.0057 0.0062 0.0068 0.0075 1.7336
CF + Popularity 0.0168 0.0145 0.0153 0.0171 0.0184 3.9252
GNN + Popularity 0.0000 0.0000 0.0001 0.0001 0.0002 0.0381
CF + Matching 0.0262 0.0244 0.0262 0.0294 0.0318 7.0675
and item node’s degree respectively is 10.9 and 8.4 which is small. Overall, the content-
based combined with the item popularity approach is considered the best algorithm for
the recommendation system in this dataset. The collaborative filtering technique also
achieves good results on every metric and is just slightly behind the content-based one.
In the CareerBuilder2012 dataset, the collaborative filtering method which is the
best model is further combined with the matching method instead of item popularity.
Table 3.6 shows the overall results of different methods in this dataset. Different from
the RecSys2016, the item popularity of this dataset performs poorly and the matching
method is a significant improvement which might be due to the impact of the existing
recommendation algorithms in CareerBuilder. The content-based approach also per-
forms more poorly than the matching one because the number of new target users is
57% which is big and hurts the performance of the content-based method as discussed
above. The graph neural networks method in the CareerBuilder2012 dataset performs
worse than the one in the RecSys2016 dataset and becomes the worst among these
algorithms. This might be because the graph in this dataset is also sparse and the
connectivity between the user and item nodes is even weaker since the edge density of
the graph is 0.0014% and the average user and item node’s degree respectively is 5.0
and 4.2 which is minor. The hybrid technique based on the collaborative filtering and
matching approach dominates the others, including the hybrid technique based on the
collaborative filtering and item popularity method on every evaluation metric.
To sum up, the best-performing model in the RecSys2016 and CareerBuilder2012 is
respectively the hybrid model based on content-based combined with the item popularity
67
method and collaborative filtering combined with the matching method. This aligns
with the statement ”the performance of the different models differs considerably across
datasets” [13]. This implies that the results are dataset dependent.
As mentioned, the two datasets RecSys2016 and CareerBuilder2012 are the only
publicly available datasets that are suitable for the job recommendation system prob-
lem the thesis could find. Although these competitions’ datasets are commonly used
after completion of the competition to train and validate job recommendation systems
when no dataset is available, most studies in this field use their own private data sets.
Moreover, the test sets of these datasets are only accessible during the contest period
and are restricted now despite the author’s attempts. Most of the literature based
on the RecSys2016 dataset was published in 2016 to serve the challenge and utilized
the competition’s test set and evaluation measure as well. For the CareerBuilder2012
dataset, the number of publications is small with different evaluation metrics. Despite
the difference in the test set, comparisons between the thesis’s results and the results
from the original competition using corresponding evaluation metrics are still drawn.
With the RecSys2016 dataset, our collaborative filtering with item popularity approach
reaches RSScore =6.7410 which outperforms the top 1 score in the original 2016 chal-
lenge (681,707.38 for 150,000 target users which is equivalent to RSScore =4.5447).
About the CareerBuilder2012, the best result in the thesis is MAP@150 = 0.0318 using
collaborative filtering combined with the matching approach falls behind the top 1 score
for the public and private test set in the original contest is MAP@150 = 0.1815 for
approximately 7,000 target users and MAP@150 = 0.1828 for about 16,000 users. It is
worth noting that the differences in test sets may have contributed significantly to the
variations in performance metrics across the studies, especially when the original test
set in the CareerBuilder2012 challenge used separate interactions during 13 weeks that
split into 7 windows.
Overall, the classical approaches for recommendation systems like user-item match-
ing, content-based and collaborative filtering still show their strength and practical ap-
plication on these two datasets. Hybrid models from these methods also play a crucial
role in overcoming the limitations of each individual technique: the cold-start problem
in collaborative filtering where it is difficult to provide recommendations for new users
or heavy dependency on item features, which may not capture the diversity of user
preferences in content-based recommendation system. By combining the approaches, a
hybrid model can leverage the advantages of each technique and provide more accurate
and diverse recommendations.
68
Conclusion and Future work
Conclusion
Currently, Vietnam has not had much research on the problem of the job recommen-
dation system. Also, there is a lack of labor market datasets in general and test data in
particular, this thesis could not compare the results with other authors’ but only make
comparisons within the different algorithms implemented by the author themselves. The
main contributions of this thesis are as follows:
Analyze jobs and candidates’ behavior and profile from two labor market datasets:
RecSys2016 and CareerBuilder2012, providing valuable insights for a better un-
derstanding of the job market.
Various job recommendation algorithms such as item popularity, user-item match-
ing, content-based, user-based collaborative filtering, and graph neural network
were implemented and experimented on the two datasets.
The performance and practical usability of these models in the context of job rec-
ommendation systems were assessed utilizing two metrics: Map@K and RSScore.
The content-based combined with the item popularity approach is generally the
best model in the RecSys2016 dataset at Map@1 = 0.0554, Map@5 = 0.0356,
Map@10=0.0350, MAP@30=0.0364, Map@150 = 0.0364, RSScore =4.5414.
On the other hand, in the CareerBuilder2012 dataset, the collaborative filtering as-
sociated with the matching method outperforms the others and achieves the high-
est evaluation metrics: Map@1 = 0.0262, Map@5 = 0.0244, Map@10 = 0.0262,
Map@30 = 0.0294, Map@150 = 0.0318, RSScore =7.0675.
In addition to the achieved contribution of the thesis, there are still some shortcom-
ings such as:
Lack of information extractor from textual data in the CareerBuilder2012 dataset,
which might give some useful features.
Lack of test data and comprehensive comparison to other research.
Ethical concerns about privacy and data security, particularly if the system collects
sensitive information from job seekers were not considered in this thesis.
69
A part of the research in the thesis has been published at The 35th International
Conference on Industrial, Engineering & Other Applications of Applied Intelligent Sys-
tems (IEA/AIE 2022) in Japan [5].
Future work
Finding a job is a complex process, influenced by both explicit and implicit factors.
Various recommendation algorithms were implemented, experimented and evaluated
with much potential for future improvement. Since building artificial intelligence models
is an iterative process, this thesis suggests some future development directions as follows:
Collecting and incorporating more data sources. Moreover, some NLP models,
such as Named Entity Recognition, Skill Extractor, and Skill Standardizer,...
might be involved thereby enhancing the embedding representation in particu-
lar and the overall results of the recommendation system in general.
Exploring more advanced machine learning models. In this thesis, some classical
recommendation models such as content-based and collaborative were utilized.
Graph neural networks-based approach was also experimented with but did not
reach the expected performance. This indicates that there is room for advanced
models such as deep learning could be explored to improve the system’s perfor-
mance.
Designing an end-to-end recommendation system architecture. Since a user often
comes into the system with a CV/resume, the system can have an information
extraction model implemented beforehand to get useful features from the user’s
CV/resume and job description without expecting the user to fill in his/her profile
on the platform, which advances the user’s experiment.
Applying inductive learning to make recommendations for new, unseen users,
items. Since new entities appearing in the online job portal are increasing dra-
matically, inductive reasoning might help improve users’ engagement impressively.
70
Bibliography
[1] Fabian Abel et al. “Recsys challenge 2016: Job recommendations”. In: Proceedings
of the 10th ACM conference on recommender systems. 2016, pp. 425–426.
[2] Marko Balabanovi´c and Yoav Shoham. “Fab: content-based, collaborative recom-
mendation”. In: Communications of the ACM 40.3 (1997), pp. 66–72.
[3] Wojciech Krupa Ben Hamner Road Warrior. Job Recommendation Challenge.
2012. url: https://kaggle.com/competitions/job-recommendation.
[4] Erion C¸ ano and Maurizio Morisio. “Hybrid recommender systems: A systematic
literature review”. In: Intelligent Data Analysis 21.6 (2017), pp. 1487–1524.
[5] Hai-Nam Cao et al. “Synonym Prediction for Vietnamese Occupational Skills”. In:
Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial
Intelligence: 35th International Conference on Industrial, Engineering and Other
Applications of Applied Intelligent Systems, IEA/AIE 2022, Kitakyushu, Japan,
July 19–22, 2022, Proceedings. Springer. 2022, pp. 351–362.
[6] Cheng Guo et al. “How integration helps on cold-start recommendations”. In:
Proceedings of the Recommender Systems Challenge 2017. 2017, pp. 1–6.
[7] Ken Lazarus. Performance-Based Matching: Using Machine Learning to Quickly
Find Recruiters with Proven Success. Tech. rep. Scout Exchange, 2018.
[8] Kuan Liu et al. “Temporal learning and sequence modeling for a job recommender
system”. In: Proceedings of the Recommender Systems Challenge. 2016, pp. 1–4.
[9] Saket Maheshwary and Hemant Misra. “Matching resumes to jobs via deep siamese
network”. In: Companion Proceedings of the The Web Conference 2018. 2018,
pp. 87–88.
[10] Yoosof Mashayekhi et al. “A challenge-based survey of e-recruitment recommen-
dation systems”. In: arXiv preprint arXiv:2209.05112 (2022).
[11] Motebang Daniel Mpela and Tranos Zuva. “A mobile proximity job employment
recommender system”. In: 2020 International Conference on Artificial Intelli-
gence, Big Data, Computing and Data Communication Systems (icABCD). IEEE.
2020, pp. 1–6.
[12] Mirko Polato and Fabio Aiolli. “A preliminary study on a recommender system for
the job recommendation challenge”. In: Proceedings of the Recommender Systems
Challenge. 2016, pp. 1–4.
71
[13] Corn´e de Ruijt and Sandjai Bhulai. “Job recommender systems: A review”. In:
arXiv preprint arXiv:2111.13576 (2021).
[14] Walid Shalaby et al. “Help me find a job: A graph-based approach for job rec-
ommendation at scale”. In: 2017 IEEE international conference on big data (big
data). IEEE. 2017, pp. 1544–1553.
[15] Wenming Xiao et al. “Job recommendation with hawkes process: an effective
solution for recsys challenge 2016”. In: Proceedings of the recommender systems
challenge. 2016, pp. 1–4.
[16] Chenrui Zhang and Xueqi Cheng. “An ensemble method for job recommender
systems”. In: Proceedings of the Recommender Systems Challenge. 2016, pp. 1–4.
[17] Justin Zobel and Alistair Moffat. “Inverted files for text search engines”. In: ACM
computing surveys (CSUR) 38.2 (2006), 6–es.
72